Background

The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform initial modeling on the processed weather files.

Data Availability

There are three main processed files available for further exploration:

metar_postEDA_20200617.rds

  • source (chr) - the reporting station and time
  • locale (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • origMETAR (chr) - the original METAR associated with the observation at that source and date-time
  • year (dbl) - the year, extracted from dtime
  • monthint (dbl) - the month, extracted from dtime, as an integer
  • month (fct) - the month, extracted from dtime, as a three-character abbreviation (factor)
  • day (int) - the day of the month, extracted from dtime
  • WindDir (chr) - previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • predomDir (chr) - the predominant wind direction as NE-E-SE-S-SW-W-NW-N-VRB-000-Error
  • Visibility (dbl) - surface visibility in statute miles
  • Altimeter (dbl) - altimeter in inches of mercury
  • TempF (dbl) - the Fahrenheit temperature
  • DewF (dbl) - the Fahrenheit dew point
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • isRain (lgl) - was rain occurring at the moment the METAR was captured?
  • isSnow (lgl) - was snow occurring at the moment the METAR was captured?
  • isThunder (lgl) - was thunder occurring at the moment the METAR was captured?
  • p1Inches (dbl) - how many inches of rain occurred in the past hour?
  • p36Inches (dbl) - how many inches of rain occurred in the past 3/6 hours (3-hour summaries at 3Z-9Z-15Z-21Z and 6-hour summaries at 6Z-12Z-18Z-24Z and NA at any other Z times)?
  • p24Inches (dbl) - how many inches of rain occurred in the past 24 hours (at 12Z, NA at all other times)
  • tempFHi (dbl) - the high temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • tempFLo (dbl) - the low temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • minHeight (dbl) - the minimum cloud height in feet (-100 means ‘no clouds’)
  • minType (fct) - amount of obscuration at the minimum cloud height (VV > OVC > BKN > SCT > FEW > CLR)
  • ceilingHeight (dbl) - the minimum cloud ceiling in feet (-100 means ‘no ceiling’)
  • ceilingType (fct) - amount of obscuration at the minimum ceiling height (VV > OVC > BKN)

metar_modifiedClouds_20200617.rds

  • source (chr) - the reporting station and time
  • sourceName (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • level (dbl) - cloud level (level 0 is inserted for every source-dtime as a base layer of clear)
  • height (dbl) - level height (height -100 is inserted for every source-dtime as a base layer of clear)
  • type (dbl) - level type (type CLR is inserted for every source-dtime as a base layer of clear)

metar_precipLists_20200617.rds

  • Contains elements for each of rain/snow/thunder for each of 2015/2016/2017
  • Each element contains a list and a tibble
  • The tibble is precipLength and contains precipitation by month as source-locale-month-hours-events
  • The list is precipList and contains data on each precipitation interval

Glimpses of the three main files are as follows:

# The tidyverse library will be used throughout
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds")
glimpse(metarData)
## Observations: 201,216
## Variables: 41
## $ source        <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_2016", "...
## $ locale        <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Detroit,...
## $ dtime         <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-01-01...
## $ origMETAR     <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 RMK AO...
## $ year          <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,...
## $ monthint      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ month         <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ WindDir       <chr> "230", "230", "230", "240", "230", "220", "220", "220...
## $ WindSpeed     <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, 15, 1...
## $ WindGust      <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, 22, 2...
## $ predomDir     <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, S...
## $ Visibility    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, 10, 7...
## $ Altimeter     <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14, 30.1...
## $ TempF         <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92, 30.9...
## $ DewF          <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94, 19.0...
## $ modSLP        <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, 1021....
## $ cType1        <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC...
## $ cType2        <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC", "OV...
## $ cType3        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cType4        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cType5        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cType6        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "...
## $ cLevel1       <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, 2800,...
## $ cLevel2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, 4000,...
## $ cLevel3       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cLevel4       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cLevel5       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cLevel6       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ isRain        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ isSnow        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ isThunder     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ p1Inches      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, NA, 0,...
## $ p36Inches     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA, NA,...
## $ p24Inches     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ tempFHi       <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ tempFLo       <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ minHeight     <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, 2800,...
## $ minType       <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN, SCT...
## $ ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, 2800,...
## $ ceilingType   <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN, OVC...
# Extra clouds data
cloudData <- readRDS("./RInputFiles/ProcessedMETAR/metar_modifiedClouds_20200617.rds")
glimpse(cloudData)
## Observations: 393,258
## Variables: 6
## $ source     <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdt...
## $ sourceName <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Detroit, MI...
## $ dtime      <dttm> 2016-01-01 00:53:00, 2016-01-01 00:53:00, 2016-01-01 01...
## $ level      <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,...
## $ height     <dbl> -100, 2000, -100, 2000, -100, 2200, -100, 2500, -100, 27...
## $ type       <fct> CLR, OVC, CLR, OVC, CLR, OVC, CLR, OVC, CLR, OVC, CLR, O...
# Precipitation summaries
precipData <- readRDS("./RInputFiles/ProcessedMETAR/metar_precipLists_20200617.rds")
names(precipData)
## [1] "rain2015"    "rain2016"    "rain2017"    "snow2015"    "snow2016"   
## [6] "snow2017"    "thunder2015" "thunder2016" "thunder2017"
names(precipData$rain2016)
## [1] "precipList"   "precipLength"
names(precipData$rain2016$precipList)
##  [1] "kdtw_2016_RA" "kewr_2016_RA" "kgrb_2016_RA" "kgrr_2016_RA" "kiah_2016_RA"
##  [6] "kind_2016_RA" "klas_2016_RA" "klnk_2016_RA" "kmke_2016_RA" "kmsn_2016_RA"
## [11] "kmsp_2016_RA" "kmsy_2016_RA" "kord_2016_RA" "ksan_2016_RA" "ktvc_2016_RA"
names(precipData$rain2016$precipList$kdtw_2016_RA)
## [1] "pStateFrame"      "pIssueTimes"      "mismatches"       "mismatchTimes"   
## [5] "beginEndInterval"
glimpse(precipData$rain2016$precipLength)
## Observations: 180
## Variables: 5
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_20...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Detroit, MI (20...
## $ month  <fct> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, ...
## $ hours  <dbl> 27.928889, 20.828333, 77.274722, 73.072778, 46.488611, 20.12...
## $ events <dbl> 16, 18, 31, 38, 41, 31, 31, 40, 49, 29, 41, 21, 20, 41, 30, ...
glimpse(precipData$rain2016$precipList$kdtw_2016_RA)
## List of 5
##  $ pStateFrame     :Classes 'tbl_df', 'tbl' and 'data.frame':    8818 obs. of  5 variables:
##   ..$ origMETAR: chr [1:8818] "KDTW 310053Z 23007KT 10SM BKN025 OVC050 02/M03 A3017 RMK AO2 SLP223 T00221033" "KDTW 310153Z 22009KT 10SM OVC021 02/M03 A3019 RMK AO2 SLP229 T00221028" "KDTW 310253Z 24009KT 10SM OVC021 02/M03 A3018 RMK AO2 SLP224 T00171028 50004" "KDTW 310353Z 24010KT 10SM OVC025 02/M04 A3018 RMK AO2 SLP225 T00171039" ...
##   ..$ dtime    : POSIXct[1:8818], format: "2015-12-31 00:53:00" "2015-12-31 01:53:00" ...
##   ..$ strPrecip: chr [1:8818] NA NA NA NA ...
##   ..$ curPrecip: logi [1:8818] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..$ lagPrecip: logi [1:8818] NA FALSE FALSE FALSE FALSE FALSE ...
##  $ pIssueTimes     :Classes 'tbl_df', 'tbl' and 'data.frame':    7 obs. of  1 variable:
##   ..$ dtime: POSIXct[1:7], format: "2016-02-24 11:53:00" "2016-04-01 00:53:00" ...
##  $ mismatches      : int(0) 
##  $ mismatchTimes   : 'POSIXct' num(0) 
##  - attr(*, "tzone")= chr "UTC"
##  $ beginEndInterval:Classes 'tbl_df', 'tbl' and 'data.frame':    387 obs. of  3 variables:
##   ..$ beginTimes: POSIXct[1:387], format: "2015-12-31 17:45:00" "2016-01-08 16:27:00" ...
##   ..$ endTime   : POSIXct[1:387], format: "2015-12-31 17:53:00" "2016-01-09 00:13:00" ...
##   ..$ precipInts:Formal class 'Interval' [package "lubridate"] with 3 slots

Helpers - Functions and Mappings

There are a few functions and mappings that can help with the modeling process:

# The process frequently uses tidyverse and lubridate
library(tidyverse)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"


# Descriptive names for key variables
varMapper <- c(source="Original source file name", 
               locale="Descriptive name",
               dtime="Date-Time (UTC)",
               origMETAR="Original METAR",
               year="Year",
               monthint="Month",
               month="Month", 
               day="Day of Month",
               WindDir="Wind Direction (degrees)", 
               WindSpeed="Wind Speed (kts)",
               WindGust="Wind Gust (kts)",
               predomDir="General Prevailing Wind Direction",
               Visibility="Visibility (SM)", 
               Altimeter="Altimeter (inches Hg)",
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               modSLP="Sea-Level Pressure (hPa)", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               isRain="Rain at Observation Time",
               isSnow="Snow at Observation Time",
               isThunder="Thunder at Obsevation Time",
               tempFHi="24-hour High Temperature (F)",
               tempFLo="24-hour Low Temperature (F)",
               minHeight="Minimum Cloud Height (ft)",
               minType="Obscuration Level at Minimum Cloud Height",
               ceilingHeight="Minimum Ceiling Height (ft)",
               ceilingType="Obscuration Level at Minimum Ceiling Height", 
               hr="Hour of Day (Zulu time)"
               )


# File name to city name mapper
cityNameMapper <- c(kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ktvc_2016="Traverse City, MI (2016)"
                    )

Initial Exploration

The caret package allows for many types of models to be called on the data.

A function is written to subset the data and create appropriate train-test splits:

# Create test and train data, filtered for locales and months, and keeping only relevant variables
createTestTrain <- function(df, 
                            sources=NULL, 
                            vrbls=NULL,
                            convFactor="locale",
                            months=1:12,
                            testSize=0.3,
                            noNA=TRUE,
                            seed=NULL
                            ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble containing the data
    # sources: the source records to be included (NULL will include all)
    # vrbls: the variables to be included (NULL will include all)
    # convFactor: variable to convert to a factor after processing (NULL for none)
    # months: the months to be included
    # testSize: the fraction of observations to be included as test
    # seed: the seed for reproducibility

    # Set the seed if it has been passed
    if (!is.null(seed)) { set.seed(seed) }
    
    # Set sources to be all sources if NULL
    if (is.null(sources)) { 
        sources <- df %>%
            count(source) %>%
            pull(source)
    }
    
    # Set vrbls to be all variables if NULL
    if (is.null(vrbls)) { vrbls <- names(df) }
    
    # Filter the data for the relevant sources and months, and limit to vars
    dfMod <- df %>%
        filter(source %in% sources, monthint %in% months) %>%
        select_at(vars(all_of(vrbls)))
    
    # Use only complete cases if noNA=TRUE
    if (noNA) {
        dfMod <- dfMod %>%
            filter(complete.cases(dfMod))
    }
    
    # Convert the key variable to a factor if requested
    if (!is.null(convFactor)) {
        dfMod <- dfMod %>%
            mutate_at(vars(all_of(convFactor)), factor)
    }
    
    # Split in to test and train (do not sort so that records by locale do not end up in any given order)
    idx <- sample(1:nrow(dfMod), round((1-testSize) * nrow(dfMod)), replace=FALSE)

    # Return the test and train data
    list(testData=dfMod[-idx, ], 
         trainData=dfMod[idx, ]
         )
    
}

As an example, data for 2016 are pulled for two of the cities - Las Vegas and New Orleans:

ttLists <- createTestTrain(metarData, 
                           sources=c("klas_2016", "kmsy_2016"), 
                           vrbls=c("locale", "month", "predomDir", "TempF", "DewF", "Altimeter", "modSLP"),
                           seed=2006181356
                           )
ttLists
## $testData
## # A tibble: 5,246 x 7
##    locale               month predomDir TempF  DewF Altimeter modSLP
##    <fct>                <fct> <fct>     <dbl> <dbl>     <dbl>  <dbl>
##  1 Las Vegas, NV (2016) Jan   NE         37.0  8.06      30.2  1024 
##  2 Las Vegas, NV (2016) Jan   NE         35.1  8.96      30.2  1024 
##  3 Las Vegas, NV (2016) Jan   S          30.0  8.96      30.2  1024.
##  4 Las Vegas, NV (2016) Jan   S          27.0  8.06      30.2  1025.
##  5 Las Vegas, NV (2016) Jan   S          28.9  8.06      30.2  1025.
##  6 Las Vegas, NV (2016) Jan   S          37.9 12.0       30.2  1023.
##  7 Las Vegas, NV (2016) Jan   000        39.0 12.9       30.2  1023.
##  8 Las Vegas, NV (2016) Jan   NE         42.1 12.9       30.2  1022.
##  9 Las Vegas, NV (2016) Jan   000        41   12.9       30.2  1022.
## 10 Las Vegas, NV (2016) Jan   N          51.1 18.0       30.2  1021.
## # ... with 5,236 more rows
## 
## $trainData
## # A tibble: 12,242 x 7
##    locale                 month predomDir TempF  DewF Altimeter modSLP
##    <fct>                  <fct> <fct>     <dbl> <dbl>     <dbl>  <dbl>
##  1 Las Vegas, NV (2016)   Jul   SW        102.   21.9      29.7  1004 
##  2 New Orleans, LA (2016) Aug   SW         78.1  73.9      29.9  1012.
##  3 New Orleans, LA (2016) May   E          81.0  72.0      29.9  1012.
##  4 Las Vegas, NV (2016)   Dec   000        61.0  36.0      30.0  1017.
##  5 New Orleans, LA (2016) Dec   VRB        62.1  57.0      30.1  1019.
##  6 New Orleans, LA (2016) Mar   S          72.0  59        30.0  1016 
##  7 Las Vegas, NV (2016)   Apr   S          75.0  12.0      29.8  1008.
##  8 Las Vegas, NV (2016)   Nov   S          62.1  23        29.9  1011.
##  9 Las Vegas, NV (2016)   Dec   S          37.0  25.0      30.4  1030.
## 10 New Orleans, LA (2016) Aug   E          93.9  75.0      30.0  1016.
## # ... with 12,232 more rows

A simple random forest can then be applied to the data, with several options for complexity:

# Create a tuning grid and run the models
trGrid <- expand.grid(min.node.size=c(1, 5, 10, 25, 100), 
                      mtry=c(1, 2, 3, 4, 5, 6), 
                      splitrule=c("gini")
                      )

caretModel <- caret::train(locale ~ ., 
                           data=ttLists$trainData,
                           method="ranger",
                           tuneGrid=trGrid,
                           trControl=caret::trainControl(method="cv", number=5),
                           num.trees=50
                           )
caretModel
## Random Forest 
## 
## 12242 samples
##     6 predictor
##     2 classes: 'Las Vegas, NV (2016)', 'New Orleans, LA (2016)' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9793, 9793, 9795, 9793, 9794 
## Resampling results across tuning parameters:
## 
##   min.node.size  mtry  Accuracy   Kappa    
##     1            1     0.8427581  0.6855732
##     1            2     0.9449438  0.8898891
##     1            3     0.9732885  0.9465764
##     1            4     0.9825196  0.9650387
##     1            5     0.9842349  0.9684695
##     1            6     0.9843985  0.9687966
##     5            1     0.8451177  0.6903017
##     5            2     0.9421621  0.8843279
##     5            3     0.9742687  0.9485368
##     5            4     0.9810491  0.9620978
##     5            5     0.9830913  0.9661822
##     5            6     0.9844800  0.9689597
##    10            1     0.8492887  0.6986454
##    10            2     0.9402060  0.8804172
##    10            3     0.9716545  0.9433086
##    10            4     0.9803137  0.9606268
##    10            5     0.9829277  0.9658549
##    10            6     0.9839898  0.9679792
##    25            1     0.8367047  0.6734723
##    25            2     0.9436314  0.8872678
##    25            3     0.9703475  0.9406938
##    25            4     0.9774547  0.9549087
##    25            5     0.9802316  0.9604627
##    25            6     0.9812124  0.9624242
##   100            1     0.8462636  0.6925740
##   100            2     0.9366941  0.8733933
##   100            3     0.9594830  0.9189650
##   100            4     0.9684685  0.9369361
##   100            5     0.9698577  0.9397143
##   100            6     0.9718186  0.9436358
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 6, splitrule = gini
##  and min.node.size = 5.

The model is not worried about over-fitting, choosing the smallest minimum node size and the largest mrty. How well does this model work on the test data?

# Run the best parameters from ranger in randomForest
caretBest <- randomForest::randomForest(locale ~ ., 
                                        data=ttLists$trainData,
                                        ntree=50, 
                                        nodesize=1, 
                                        mtry=6
                                        )
caretBest
## 
## Call:
##  randomForest(formula = locale ~ ., data = ttLists$trainData,      ntree = 50, nodesize = 1, mtry = 6) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 1.47%
## Confusion matrix:
##                        Las Vegas, NV (2016) New Orleans, LA (2016) class.error
## Las Vegas, NV (2016)                   6034                     95  0.01550008
## New Orleans, LA (2016)                   85                   6028  0.01390479

The model estimates an OOB error rate of between 1%-2%, suggesting a strong ability to differentiate Las Vegas from New Orleans.

A function is built to assess performance on the test data:

# Predictions and confusion matrices on test data
evalPredictions <- function(model, 
                            testData, 
                            printAll=TRUE,
                            printCM=printAll, 
                            printConfSummary=printAll, 
                            printConfTable=printAll, 
                            showPlots=TRUE
                            ) {
    
    # FUNCTION ARGUMENTS:
    # model: a trained model
    # testData: the test data to apply the model against
    # printAll: whether to print the text summaries (by default, carries to the next three arguments)
    # printCM: whether to print the confusion matrix
    # printConfSummary: whether to print the voting summary (rough proxy for confidence) of the predictions
    # printConfTable: whether to summarize the accuracy by voting summary
    # showPlots: whether to display plots related to the outputs
    
    # Get the predicted class and probabilities
    testClass <- predict(model, newdata=testData)
    testProbs <- predict(model, newdata=testData, type="prob")
    if (printCM) {
        print(caret::confusionMatrix(testClass, testData$locale))
    }
    
    # Create a tibble containing class prediction, maximum probability, and individual predictions
    tblProbs <- tibble::as_tibble(testProbs) %>%
        mutate(maxProb=apply(., 1, FUN=max), 
               sumProb=apply(., 1, FUN=sum), 
               predClass=testClass, 
               locale=testData$locale, 
               accurate=(predClass==locale)
               )
    
    # Describe the maximum probability by source
    if (printConfSummary) {
        tblProbs %>%
            group_by(locale) %>%
            summarize(meanMax=mean(maxProb), medianMax=median(maxProb), 
                      pct90Plus=mean(maxProb > 0.9), pct50Minus=mean(maxProb < 0.5)
                      ) %>%
        print()
    }
    
    # Create a table of accuracy by source and prediction confidence
    p1Data <- tblProbs %>%
        mutate(predProb=0.5 * round(2*maxProb, 1)) %>%
        group_by(predProb, locale) %>%
        summarize(pctCorrect=mean(accurate), nCorrect=sum(accurate), nObs=n())
    
    p1Print <- p1Data %>%
        group_by(predProb) %>%
        summarize(nCorrect=sum(nCorrect), nObs=sum(nObs)) %>%
        mutate(pctCorrect=nCorrect/nObs)
    if (printConfTable) {
        print(p1Print)
    }
    
    cat("\nMean Error-Squared Between Confidence of Prediction and Accuracy of Precition\n")
    p1Print %>%
        mutate(err2=nObs*(pctCorrect-predProb)**2) %>%
        summarize(meanError2=sum(err2)/sum(nObs)) %>%
        print()
    
    # Plot the maximum probability forecasted by row
    p1 <- p1Data %>%
        ggplot(aes(x=predProb)) +
        geom_col(aes(y=nObs, fill=locale)) + 
        labs(x="Maximum probability predicted", y="# Observations", 
             title="Count of Maximum Probability Predicted by Locale"
             )
    p2 <- p1Data %>%
        ggplot(aes(x=predProb)) +
        geom_line(aes(y=pctCorrect, group=locale, color=locale)) + 
        geom_abline(aes(intercept=0, slope=1), lty=2) +
        ylim(c(0, 1)) + 
        labs(x="Maximum probability predicted", y="Actual Probability Correct", 
             title="Accuracy of Maximum Probability Predicted by Locale"
             )
    
    if (showPlots) {
        print(p1)
        print(p2)
    }
    
    tblProbs
    
}

And the function is then run for the Las Vegas and New Orleans data:

evalPredictions(caretBest, testData=ttLists$testData)
## Confusion Matrix and Statistics
## 
##                         Reference
## Prediction               Las Vegas, NV (2016) New Orleans, LA (2016)
##   Las Vegas, NV (2016)                   2578                     27
##   New Orleans, LA (2016)                   28                   2613
##                                               
##                Accuracy : 0.9895              
##                  95% CI : (0.9864, 0.9921)    
##     No Information Rate : 0.5032              
##     P-Value [Acc > NIR] : <2e-16              
##                                               
##                   Kappa : 0.979               
##                                               
##  Mcnemar's Test P-Value : 1                   
##                                               
##             Sensitivity : 0.9893              
##             Specificity : 0.9898              
##          Pos Pred Value : 0.9896              
##          Neg Pred Value : 0.9894              
##              Prevalence : 0.4968              
##          Detection Rate : 0.4914              
##    Detection Prevalence : 0.4966              
##       Balanced Accuracy : 0.9895              
##                                               
##        'Positive' Class : Las Vegas, NV (2016)
##                                               
## # A tibble: 2 x 5
##   locale                 meanMax medianMax pct90Plus pct50Minus
##   <fct>                    <dbl>     <dbl>     <dbl>      <dbl>
## 1 Las Vegas, NV (2016)     0.978         1     0.923          0
## 2 New Orleans, LA (2016)   0.983         1     0.939          0
## # A tibble: 11 x 4
##    predProb nCorrect  nObs pctCorrect
##       <dbl>    <int> <int>      <dbl>
##  1     0.5        11    20      0.55 
##  2     0.55       10    21      0.476
##  3     0.6        14    20      0.7  
##  4     0.65       13    18      0.722
##  5     0.7        34    43      0.791
##  6     0.75       29    33      0.879
##  7     0.8        50    52      0.962
##  8     0.85       55    55      1    
##  9     0.9       152   157      0.968
## 10     0.95      193   196      0.985
## 11     1        4630  4631      1.00 
## 
## Mean Error-Squared Between Confidence of Prediction and Accuracy of Precition
## # A tibble: 1 x 1
##   meanError2
##        <dbl>
## 1   0.000938

## # A tibble: 5,246 x 7
##    `Las Vegas, NV (~ `New Orleans, LA~ maxProb sumProb predClass locale accurate
##                <dbl>             <dbl>   <dbl>   <dbl> <fct>     <fct>  <lgl>   
##  1              0.98              0.02    0.98       1 Las Vega~ Las V~ TRUE    
##  2              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  3              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  4              0.98              0.02    0.98       1 Las Vega~ Las V~ TRUE    
##  5              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  6              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  7              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  8              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  9              1                 0       1          1 Las Vega~ Las V~ TRUE    
## 10              1                 0       1          1 Las Vega~ Las V~ TRUE    
## # ... with 5,236 more rows
caret::varImp(caretBest) %>%
    rownames_to_column() %>%
    ggplot(aes(x=fct_reorder(rowname, -Overall), y=Overall)) + 
    geom_col() + 
    labs(x="", y="Importance")

The model has greater than 98% accuracy in splitting Las Vegas and New Orleans, driven by 1) New Orleans being extremely humid and Las Vegas being a desert, and 2) New Orleans being at sea level and Las Vegas being at high altitude.

Using modSLP and Altimeter is arguably cheating since the values and relationships between them are highly driven by a location’s height relative to sea-level. How does the model perform if these are deleted?

# Create a tuning grid and run the models
trGrid <- expand.grid(min.node.size=c(1, 5, 10, 25, 100), 
                      mtry=c(1, 2, 3, 4), 
                      splitrule=c("gini")
                      )

caretModel <- caret::train(locale ~ DewF + month + predomDir + TempF, 
                           data=ttLists$trainData,
                           method="ranger",
                           tuneGrid=trGrid,
                           trControl=caret::trainControl(method="cv", number=5),
                           num.trees=50
                           )
caretModel
## Random Forest 
## 
## 12242 samples
##     4 predictor
##     2 classes: 'Las Vegas, NV (2016)', 'New Orleans, LA (2016)' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9795, 9793, 9793, 9794, 9793 
## Resampling results across tuning parameters:
## 
##   min.node.size  mtry  Accuracy   Kappa    
##     1            1     0.7760118  0.5520364
##     1            2     0.9124317  0.8248627
##     1            3     0.9327730  0.8655459
##     1            4     0.9500081  0.9000133
##     5            1     0.7602487  0.5204263
##     5            2     0.8996882  0.7993784
##     5            3     0.9323644  0.8647286
##     5            4     0.9480482  0.8960938
##    10            1     0.7947936  0.5895973
##    10            2     0.9154553  0.8309076
##    10            3     0.9313021  0.8626014
##    10            4     0.9465769  0.8931516
##    25            1     0.8033830  0.6068192
##    25            2     0.8972383  0.7944852
##    25            3     0.9300767  0.8601515
##    25            4     0.9462493  0.8924960
##   100            1     0.8175975  0.6352329
##   100            2     0.8979737  0.7959561
##   100            3     0.9250123  0.8500224
##   100            4     0.9345694  0.8691359
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 1.

The model can then be run using the best parameters:

# Run the best parameters from ranger in randomForest
caretBest <- randomForest::randomForest(locale ~ DewF + month + predomDir + TempF, 
                                        data=ttLists$trainData,
                                        ntree=50, 
                                        nodesize=1, 
                                        mtry=4
                                        )
caretBest
## 
## Call:
##  randomForest(formula = locale ~ DewF + month + predomDir + TempF,      data = ttLists$trainData, ntree = 50, nodesize = 1, mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 3.65%
## Confusion matrix:
##                        Las Vegas, NV (2016) New Orleans, LA (2016) class.error
## Las Vegas, NV (2016)                   5912                    217  0.03540545
## New Orleans, LA (2016)                  230                   5883  0.03762473

As expected, performance dips to around 95% accuracy:

evalPredictions(caretBest, testData=ttLists$testData)
## Confusion Matrix and Statistics
## 
##                         Reference
## Prediction               Las Vegas, NV (2016) New Orleans, LA (2016)
##   Las Vegas, NV (2016)                   2514                     89
##   New Orleans, LA (2016)                   92                   2551
##                                               
##                Accuracy : 0.9655              
##                  95% CI : (0.9602, 0.9703)    
##     No Information Rate : 0.5032              
##     P-Value [Acc > NIR] : <2e-16              
##                                               
##                   Kappa : 0.931               
##                                               
##  Mcnemar's Test P-Value : 0.8818              
##                                               
##             Sensitivity : 0.9647              
##             Specificity : 0.9663              
##          Pos Pred Value : 0.9658              
##          Neg Pred Value : 0.9652              
##              Prevalence : 0.4968              
##          Detection Rate : 0.4792              
##    Detection Prevalence : 0.4962              
##       Balanced Accuracy : 0.9655              
##                                               
##        'Positive' Class : Las Vegas, NV (2016)
##                                               
## # A tibble: 2 x 5
##   locale                 meanMax medianMax pct90Plus pct50Minus
##   <fct>                    <dbl>     <dbl>     <dbl>      <dbl>
## 1 Las Vegas, NV (2016)     0.962         1     0.868          0
## 2 New Orleans, LA (2016)   0.971         1     0.900          0
## # A tibble: 11 x 4
##    predProb nCorrect  nObs pctCorrect
##       <dbl>    <int> <int>      <dbl>
##  1     0.5         8    27      0.296
##  2     0.55       12    22      0.545
##  3     0.6        37    64      0.578
##  4     0.65       38    59      0.644
##  5     0.7        62    79      0.785
##  6     0.75       31    50      0.62 
##  7     0.8        85   104      0.817
##  8     0.85       80    93      0.860
##  9     0.9       182   194      0.938
## 10     0.95      221   227      0.974
## 11     1        4309  4327      0.996
## 
## Mean Error-Squared Between Confidence of Prediction and Accuracy of Precition
## # A tibble: 1 x 1
##   meanError2
##        <dbl>
## 1   0.000589

## # A tibble: 5,246 x 7
##    `Las Vegas, NV (~ `New Orleans, LA~ maxProb sumProb predClass locale accurate
##                <dbl>             <dbl>   <dbl>   <dbl> <fct>     <fct>  <lgl>   
##  1              0.96              0.04    0.96       1 Las Vega~ Las V~ TRUE    
##  2              0.96              0.04    0.96       1 Las Vega~ Las V~ TRUE    
##  3              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  4              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  5              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  6              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  7              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  8              1                 0       1          1 Las Vega~ Las V~ TRUE    
##  9              1                 0       1          1 Las Vega~ Las V~ TRUE    
## 10              1                 0       1          1 Las Vega~ Las V~ TRUE    
## # ... with 5,236 more rows
caret::varImp(caretBest) %>%
    rownames_to_column() %>%
    ggplot(aes(x=fct_reorder(rowname, -Overall), y=Overall)) + 
    geom_col() + 
    labs(x="", y="Importance")

The very large difference in dew point drives the high classification ability:

ttLists$trainData %>% 
    bind_rows(ttLists$testData) %>%
    ggplot(aes(x=TempF, y=DewF)) + 
    geom_bin2d() + 
    facet_wrap(~locale) + 
    geom_density_2d() + 
    scale_fill_continuous(low="white", high="black")

Suppose on the other hand that the model tries to distinguish New Orleans from Houston:

ttLists <- createTestTrain(metarData, 
                           sources=c("kiah_2016", "kmsy_2016"), 
                           vrbls=c("locale", "month", "predomDir", "TempF", "DewF"),
                           seed=2006181440
                           )
ttLists
## $testData
## # A tibble: 5,248 x 5
##    locale             month predomDir TempF  DewF
##    <fct>              <fct> <fct>     <dbl> <dbl>
##  1 Houston, TX (2016) Jan   N          53.1  39.9
##  2 Houston, TX (2016) Jan   N          52.0  39.9
##  3 Houston, TX (2016) Jan   N          48.0  42.1
##  4 Houston, TX (2016) Jan   N          48.0  35.1
##  5 Houston, TX (2016) Jan   N          48.0  35.1
##  6 Houston, TX (2016) Jan   N          48.9  35.1
##  7 Houston, TX (2016) Jan   N          48.0  35.1
##  8 Houston, TX (2016) Jan   N          48.0  37.0
##  9 Houston, TX (2016) Jan   N          46.9  39.0
## 10 Houston, TX (2016) Jan   N          45.0  37.0
## # ... with 5,238 more rows
## 
## $trainData
## # A tibble: 12,245 x 5
##    locale                 month predomDir TempF  DewF
##    <fct>                  <fct> <fct>     <dbl> <dbl>
##  1 New Orleans, LA (2016) Dec   N          53.1  37.9
##  2 Houston, TX (2016)     Oct   S          75.9  71.1
##  3 Houston, TX (2016)     Oct   NE         73.0  44.1
##  4 New Orleans, LA (2016) May   N          72.0  55.9
##  5 New Orleans, LA (2016) Aug   S          78.1  72.0
##  6 Houston, TX (2016)     Jun   000        78.1  75.0
##  7 Houston, TX (2016)     Mar   000        54.0  53.1
##  8 Houston, TX (2016)     Oct   N          81.0  46.0
##  9 New Orleans, LA (2016) Dec   SW         80.1  64.9
## 10 Houston, TX (2016)     Sep   S          84.0  78.1
## # ... with 12,235 more rows
# Create a tuning grid and run the models
trGrid <- expand.grid(min.node.size=c(1, 5, 10, 25, 100), 
                      mtry=c(1, 2, 3, 4), 
                      splitrule=c("gini")
                      )

caretModel <- caret::train(locale ~ DewF + month + predomDir + TempF, 
                           data=ttLists$trainData,
                           method="ranger",
                           tuneGrid=trGrid,
                           trControl=caret::trainControl(method="cv", number=5),
                           num.trees=50
                           )
caretModel
## Random Forest 
## 
## 12245 samples
##     4 predictor
##     2 classes: 'Houston, TX (2016)', 'New Orleans, LA (2016)' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9796, 9796, 9796, 9796, 9796 
## Resampling results across tuning parameters:
## 
##   min.node.size  mtry  Accuracy   Kappa    
##     1            1     0.5741119  0.1467460
##     1            2     0.6125766  0.2248296
##     1            3     0.6340547  0.2680327
##     1            4     0.6553695  0.3108463
##     5            1     0.5731319  0.1444556
##     5            2     0.6140465  0.2278396
##     5            3     0.6317681  0.2635971
##     5            4     0.6561045  0.3122708
##    10            1     0.5743569  0.1471436
##    10            2     0.6134749  0.2265743
##    10            3     0.6307064  0.2614035
##    10            4     0.6551245  0.3103053
##    25            1     0.5802368  0.1595270
##    25            2     0.6185382  0.2367579
##    25            3     0.6270314  0.2539625
##    25            4     0.6541445  0.3082967
##   100            1     0.5728052  0.1440925
##   100            2     0.6066966  0.2129075
##   100            3     0.6273581  0.2546343
##   100            4     0.6435280  0.2871188
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 5.

The best parameters from the model can then be run:

# Run the best parameters from ranger in randomForest
caretBest <- randomForest::randomForest(locale ~ DewF + month + predomDir + TempF, 
                                        data=ttLists$trainData,
                                        ntree=50, 
                                        nodesize=10, 
                                        mtry=4
                                        )
caretBest
## 
## Call:
##  randomForest(formula = locale ~ DewF + month + predomDir + TempF,      data = ttLists$trainData, ntree = 50, nodesize = 10, mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 27.48%
## Confusion matrix:
##                        Houston, TX (2016) New Orleans, LA (2016) class.error
## Houston, TX (2016)                   4269                   1880   0.3057408
## New Orleans, LA (2016)               1485                   4611   0.2436024

Accuracy dips to the 70% range, though this is still surprisingly high given how similar the climates are in New Orleans and Houston:

evalPredictions(caretBest, testData=ttLists$testData)
## Confusion Matrix and Statistics
## 
##                         Reference
## Prediction               Houston, TX (2016) New Orleans, LA (2016)
##   Houston, TX (2016)                   1789                    587
##   New Orleans, LA (2016)                802                   2070
##                                             
##                Accuracy : 0.7353            
##                  95% CI : (0.7232, 0.7472)  
##     No Information Rate : 0.5063            
##     P-Value [Acc > NIR] : < 2.2e-16         
##                                             
##                   Kappa : 0.47              
##                                             
##  Mcnemar's Test P-Value : 9.357e-09         
##                                             
##             Sensitivity : 0.6905            
##             Specificity : 0.7791            
##          Pos Pred Value : 0.7529            
##          Neg Pred Value : 0.7208            
##              Prevalence : 0.4937            
##          Detection Rate : 0.3409            
##    Detection Prevalence : 0.4527            
##       Balanced Accuracy : 0.7348            
##                                             
##        'Positive' Class : Houston, TX (2016)
##                                             
## # A tibble: 2 x 5
##   locale                 meanMax medianMax pct90Plus pct50Minus
##   <fct>                    <dbl>     <dbl>     <dbl>      <dbl>
## 1 Houston, TX (2016)       0.784      0.78     0.293          0
## 2 New Orleans, LA (2016)   0.785      0.8      0.275          0
## # A tibble: 11 x 4
##    predProb nCorrect  nObs pctCorrect
##       <dbl>    <int> <int>      <dbl>
##  1     0.5       119   229      0.520
##  2     0.55      198   352      0.562
##  3     0.6       308   535      0.576
##  4     0.65      228   370      0.616
##  5     0.7       335   504      0.665
##  6     0.75      297   409      0.726
##  7     0.8       432   582      0.742
##  8     0.85      324   404      0.802
##  9     0.9       457   564      0.810
## 10     0.95      431   499      0.864
## 11     1         730   800      0.912
## 
## Mean Error-Squared Between Confidence of Prediction and Accuracy of Precition
## # A tibble: 1 x 1
##   meanError2
##        <dbl>
## 1    0.00362

## # A tibble: 5,248 x 7
##    `Houston, TX (20~ `New Orleans, LA~ maxProb sumProb predClass locale accurate
##                <dbl>             <dbl>   <dbl>   <dbl> <fct>     <fct>  <lgl>   
##  1              0.4               0.6     0.6        1 New Orle~ Houst~ FALSE   
##  2              0.6               0.4     0.6        1 Houston,~ Houst~ TRUE    
##  3              0.2               0.8     0.8        1 New Orle~ Houst~ FALSE   
##  4              0                 1       1          1 New Orle~ Houst~ FALSE   
##  5              0                 1       1          1 New Orle~ Houst~ FALSE   
##  6              0.06              0.94    0.94       1 New Orle~ Houst~ FALSE   
##  7              0                 1       1          1 New Orle~ Houst~ FALSE   
##  8              0                 1       1          1 New Orle~ Houst~ FALSE   
##  9              0.64              0.36    0.64       1 Houston,~ Houst~ TRUE    
## 10              1                 0       1          1 Houston,~ Houst~ TRUE    
## # ... with 5,238 more rows
caret::varImp(caretBest) %>%
    rownames_to_column() %>%
    ggplot(aes(x=fct_reorder(rowname, -Overall), y=Overall)) + 
    geom_col() + 
    labs(x="", y="Importance")

The mix of temperature and dew-point is the primary driver of the classiciations. There are many more “low confidence” (close voting) predictions for these two cities:

ttLists$trainData %>% 
    bind_rows(ttLists$testData) %>%
    ggplot(aes(x=TempF, y=DewF)) + 
    geom_bin2d() + 
    facet_wrap(~locale) + 
    geom_density_2d() + 
    scale_fill_continuous(low="white", high="black")

It is impressive that the model can tease out distinctions in data that are, at a glance, very similar.

Temperature and Dew Point

Suppose that the only information available about two cities were their temperatures and dew points. How well would a basic random forest, with mtry=2, classify the cities?

A function is written to take two locales and a random seed, create relevant test-train data, run a random forest model, make predictions, assess the accuracy, and return the relevant objects:

# The createTestTrain function is updated to purely split an input dataframe
createTestTrain <- function(df, 
                            testSize=0.3, 
                            sortTrain=FALSE,
                            noNA=TRUE,
                            seed=NULL
                            ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble for analysis
    # testSize: the proportion of the data to be used as test
    # sortTrain: boolean, whether to sort training indices to maintain the original order of training data
    # noNA: boolean, whether to include only complete cases
    # seed: the random seed to be used (NULL means no seed)
    
    # Filter out NA if requested
    if (noNA) {
        df <- df %>%
            filter(complete.cases(df))
    }
    
    # Get the desired number of train objects
    nTrain <- round((1-testSize) * nrow(df))
    
    # Set the random seed if it has been passed
    if (!is.null(seed)) { set.seed(seed) }
    
    # Get the indices for the training data
    idxTrain <- sample(1:nrow(df), size=nTrain, replace=FALSE)
    
    # Sort if requested
    if (sortTrain) { idxTrain <- sort(idxTrain) }
    
    # Return a list containing the train and test data
    list(trainData=df[idxTrain, ], 
         testData=df[-idxTrain, ]
         )
    
}


# Run a random forest for two locales
rfTwoLocales <- function(df,
                         loc1, 
                         loc2,
                         locVar="source",
                         otherVar="dtime",
                         vrbls=c("TempF", "DewF"),
                         pred="locale",
                         seed=NULL,
                         ntree=100,
                         mtry=NULL, 
                         testSize=0.3
                         ) {

    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble
    # loc1: the first locale
    # loc2: the second locale
    # locVar: the name of the variable where loc1 and loc2 can be found
    # otherVar: other variables to be kept, but not used in modeling
    # vrbls: explanatory variables for modeling
    # pred: predictor variable for modeling
    # seed: the random seed (NULL means no seed)
    # ntree: the number of trees to grow in the random forest
    # mtry: the splitting parameter for the random forest (NULL means use all variables)
    # testSize: proportion of data to use as test dataset
    
    # Filter df such that it includes only observations in loc1 or loc2
    # Select only the predictor and variables of interest
    dfModel <- df %>%
        filter(get(locVar) %in% c(loc1, loc2)) %>%
        select_at(vars(all_of(c(pred, otherVar, vrbls))))
    
    # Create the test-train split (will randomize using seed if provided)
    ttLists <- createTestTrain(dfModel, testSize=testSize, seed=seed)

    # Set the seed if requested
    if (!is.null(seed)) { set.seed(seed) }
    
    # Set mtry to be the length of the variables if not provided
    if (is.null(mtry)) { mtry <- length(vrbls) }
    
    # Run the random forest on the training data
    rfModel <- randomForest::randomForest(x=ttLists$trainData[, vrbls], 
                                          y=factor(ttLists$trainData[, pred, drop=TRUE]), 
                                          mtry=mtry, 
                                          ntree=ntree
                                          )
    
    # Create predictions on the test data
    testPred <- predict(rfModel, ttLists$testData)
    
    # Augment the test data with the predictions
    testData <- ttLists$testData %>%
        mutate(predicted=testPred, correct=(testPred==get(pred)))
    
    # Grab the accuracy and accuracy by class
    
    # Return the objects as a list
    list(rfModel=rfModel, 
         testData=testData,
         errorRate=colMeans(rfModel$err.rate)
         )
    
}

The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:

# Create the file names to explore
names_2016 <- grep(x=names(cityNameMapper), pattern="_2016", value=TRUE)

# Create a container list to hold the output
list_2016_TempF_DewF <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_2016_TempF_DewF[[n]] <- rfTwoLocales(metarData, 
                                                  loc1=names_2016[ctr], 
                                                  loc2=names_2016[ctr2], 
                                                  vrbls=c("TempF", "DewF"),
                                                  ntree=25
                                                  )
        n <- n + 1
    }
}

Statistics about the overall accuracy can then be captured:

# Helper function for map_dfr
helperAccuracyLocale <- function(x) {
    
    y <- x$errorRate
    tibble::tibble(locale1=names(y)[2], 
                   locale2=names(y)[3], 
                   accOverall=1-y[1], 
                   accLocale1=1-y[2], 
                   accLocale2=1-y[3]
                   )
    
}

# Create a tibble from the underlying accuracy data
acc_TempF_DewF <- map_dfr(list_2016_TempF_DewF, .f=helperAccuracyLocale)

# Assess the top 10 accuracy and the bottom 10 accuracy
acc_TempF_DewF %>%
    arrange(-accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Green Bay, WI (2016)   Las Vegas, NV (2016)       0.897      0.893      0.901
##  2 Las Vegas, NV (2016)   New Orleans, LA (201~      0.890      0.903      0.877
##  3 Las Vegas, NV (2016)   Milwaukee, WI (2016)       0.888      0.887      0.889
##  4 Houston, TX (2016)     Las Vegas, NV (2016)       0.884      0.871      0.897
##  5 Las Vegas, NV (2016)   Madison, WI (2016)         0.879      0.892      0.866
##  6 Indianapolis, IN (201~ Las Vegas, NV (2016)       0.875      0.893      0.859
##  7 Chicago, IL (2016)     Las Vegas, NV (2016)       0.874      0.893      0.854
##  8 Grand Rapids, MI (201~ Las Vegas, NV (2016)       0.872      0.888      0.857
##  9 Las Vegas, NV (2016)   San Diego, CA (2016)       0.867      0.872      0.863
## 10 Las Vegas, NV (2016)   Traverse City, MI (2~      0.865      0.863      0.868
acc_TempF_DewF %>%
    arrange(accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Chicago, IL (2016)     Milwaukee, WI (2016)       0.539      0.574      0.505
##  2 Chicago, IL (2016)     Madison, WI (2016)         0.548      0.580      0.515
##  3 Detroit, MI (2016)     Grand Rapids, MI (20~      0.550      0.491      0.610
##  4 Detroit, MI (2016)     Lincoln, NE (2016)         0.551      0.589      0.513
##  5 Chicago, IL (2016)     Detroit, MI (2016)         0.552      0.522      0.581
##  6 Grand Rapids, MI (201~ Traverse City, MI (2~      0.553      0.595      0.511
##  7 Indianapolis, IN (201~ Lincoln, NE (2016)         0.554      0.595      0.513
##  8 Detroit, MI (2016)     Traverse City, MI (2~      0.558      0.496      0.619
##  9 Detroit, MI (2016)     Indianapolis, IN (20~      0.559      0.587      0.531
## 10 Green Bay, WI (2016)   Madison, WI (2016)         0.560      0.528      0.591

Not surprisingly, accuracy is highest when comparing cold-weather cities or hot-humid cities to Las Vegas. Accuracy is lowest when comparing cold-weather cities against each other. Interestingly, Chicago-Milwaukee and Grand Rapids-Traverse City have again emerged as the two “closest” cities in this analysis.

Of further interest is an overall accuracy by city:

allAccuracy <- select(acc_TempF_DewF, locale=locale1, other=locale2, accOverall, accLocale=accLocale1) %>%
    bind_rows(select(acc_TempF_DewF, locale=locale2, other=locale1, accOverall, accLocale=accLocale2))

# Find the best match by locale
allAccuracy %>%
    group_by(locale) %>%
    top_n(accOverall, n=1) %>%
    arrange(-accOverall)
## # A tibble: 15 x 4
## # Groups:   locale [15]
##    locale                   other                accOverall accLocale
##    <chr>                    <chr>                     <dbl>     <dbl>
##  1 Green Bay, WI (2016)     Las Vegas, NV (2016)      0.897     0.893
##  2 Las Vegas, NV (2016)     Green Bay, WI (2016)      0.897     0.901
##  3 New Orleans, LA (2016)   Las Vegas, NV (2016)      0.890     0.877
##  4 Milwaukee, WI (2016)     Las Vegas, NV (2016)      0.888     0.889
##  5 Houston, TX (2016)       Las Vegas, NV (2016)      0.884     0.871
##  6 Madison, WI (2016)       Las Vegas, NV (2016)      0.879     0.866
##  7 Indianapolis, IN (2016)  Las Vegas, NV (2016)      0.875     0.893
##  8 Chicago, IL (2016)       Las Vegas, NV (2016)      0.874     0.893
##  9 Grand Rapids, MI (2016)  Las Vegas, NV (2016)      0.872     0.888
## 10 San Diego, CA (2016)     Las Vegas, NV (2016)      0.867     0.863
## 11 Traverse City, MI (2016) Las Vegas, NV (2016)      0.865     0.868
## 12 Detroit, MI (2016)       Las Vegas, NV (2016)      0.864     0.885
## 13 Minneapolis, MN (2016)   Las Vegas, NV (2016)      0.856     0.856
## 14 Lincoln, NE (2016)       Las Vegas, NV (2016)      0.845     0.833
## 15 Newark, NJ (2016)        Las Vegas, NV (2016)      0.809     0.812
# Find the worst match by locale
allAccuracy %>%
    group_by(locale) %>%
    top_n(-accOverall, n=1) %>%
    arrange(accOverall)
## # A tibble: 15 x 4
## # Groups:   locale [15]
##    locale                   other                   accOverall accLocale
##    <chr>                    <chr>                        <dbl>     <dbl>
##  1 Chicago, IL (2016)       Milwaukee, WI (2016)         0.539     0.574
##  2 Milwaukee, WI (2016)     Chicago, IL (2016)           0.539     0.505
##  3 Madison, WI (2016)       Chicago, IL (2016)           0.548     0.515
##  4 Detroit, MI (2016)       Grand Rapids, MI (2016)      0.550     0.491
##  5 Grand Rapids, MI (2016)  Detroit, MI (2016)           0.550     0.610
##  6 Lincoln, NE (2016)       Detroit, MI (2016)           0.551     0.513
##  7 Traverse City, MI (2016) Grand Rapids, MI (2016)      0.553     0.511
##  8 Indianapolis, IN (2016)  Lincoln, NE (2016)           0.554     0.595
##  9 Green Bay, WI (2016)     Madison, WI (2016)           0.560     0.528
## 10 Minneapolis, MN (2016)   Madison, WI (2016)           0.568     0.555
## 11 Newark, NJ (2016)        Lincoln, NE (2016)           0.590     0.571
## 12 Houston, TX (2016)       New Orleans, LA (2016)       0.605     0.542
## 13 New Orleans, LA (2016)   Houston, TX (2016)           0.605     0.667
## 14 San Diego, CA (2016)     Minneapolis, MN (2016)       0.764     0.834
## 15 Las Vegas, NV (2016)     Newark, NJ (2016)            0.809     0.806
# Overall accuracy by location plot
allAccuracy %>%
    group_by(locale) %>%
    summarize_if(is.numeric, mean) %>%
    ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) + 
    geom_point(size=2) + 
    geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
    coord_flip() + 
    labs(x="", 
         y="Average Accuracy", 
         title="Average Accuracy Predicting Locale",
         subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
         caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)"
         ) + 
    ylim(c(0.5, 1))

# Overall accuracy heatmap
allAccuracy %>% 
    ggplot(aes(x=locale, y=other)) + 
    geom_tile(aes(fill=accOverall)) + 
    theme(axis.text.x=element_text(angle=90)) + 
    scale_fill_continuous("Accuracy", high="darkblue", low="white") + 
    labs(title="Accuracy Predicting Locale vs. Locale", 
         caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)",
         x="",
         y=""
         )

One hypothesis is that adding month as a predictor may help for distinguishing cities in different climates. For example, summer in the warm season in Detroit may look like spring/fall in Houston or New Orleans:

# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData, 
                                                        loc1=names_2016[ctr], 
                                                        loc2=names_2016[ctr2], 
                                                        vrbls=c("TempF", "DewF", "month"),
                                                        ntree=25
                                                        )
        n <- n + 1
    }
}

Accuracy can then be assessed:

# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)

# Assess the top 10 accuracy and the bottom 10 accuracy
acc_TempF_DewF_month %>%
    arrange(-accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Green Bay, WI (2016)   Las Vegas, NV (2016)       0.961      0.960      0.961
##  2 Las Vegas, NV (2016)   Madison, WI (2016)         0.949      0.952      0.945
##  3 Las Vegas, NV (2016)   Milwaukee, WI (2016)       0.944      0.947      0.941
##  4 New Orleans, LA (2016) Traverse City, MI (2~      0.941      0.946      0.937
##  5 Las Vegas, NV (2016)   Traverse City, MI (2~      0.940      0.940      0.939
##  6 Minneapolis, MN (2016) New Orleans, LA (201~      0.939      0.941      0.938
##  7 Green Bay, WI (2016)   New Orleans, LA (201~      0.938      0.931      0.944
##  8 Grand Rapids, MI (201~ Las Vegas, NV (2016)       0.937      0.939      0.934
##  9 Houston, TX (2016)     Traverse City, MI (2~      0.935      0.938      0.931
## 10 Las Vegas, NV (2016)   Minneapolis, MN (201~      0.935      0.938      0.931
acc_TempF_DewF_month %>%
    arrange(accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Chicago, IL (2016)     Milwaukee, WI (2016)       0.578      0.619      0.536
##  2 Detroit, MI (2016)     Grand Rapids, MI (20~      0.584      0.605      0.564
##  3 Grand Rapids, MI (201~ Traverse City, MI (2~      0.598      0.623      0.574
##  4 Chicago, IL (2016)     Madison, WI (2016)         0.600      0.630      0.569
##  5 Green Bay, WI (2016)   Madison, WI (2016)         0.602      0.631      0.573
##  6 Chicago, IL (2016)     Grand Rapids, MI (20~      0.606      0.580      0.631
##  7 Grand Rapids, MI (201~ Madison, WI (2016)         0.608      0.590      0.626
##  8 Madison, WI (2016)     Milwaukee, WI (2016)       0.610      0.641      0.580
##  9 Chicago, IL (2016)     Detroit, MI (2016)         0.612      0.666      0.558
## 10 Chicago, IL (2016)     Indianapolis, IN (20~      0.613      0.637      0.590
allAccuracy_month <- select(acc_TempF_DewF_month, 
                            locale=locale1, 
                            other=locale2, 
                            accOverall, 
                            accLocale=accLocale1
                            ) %>%
    bind_rows(select(acc_TempF_DewF_month, 
                     locale=locale2, 
                     other=locale1, 
                     accOverall, 
                     accLocale=accLocale2
                     )
              )

# Find the best match by locale
allAccuracy_month %>%
    group_by(locale) %>%
    top_n(accOverall, n=1) %>%
    arrange(-accOverall)
## # A tibble: 15 x 4
## # Groups:   locale [15]
##    locale                   other                    accOverall accLocale
##    <chr>                    <chr>                         <dbl>     <dbl>
##  1 Green Bay, WI (2016)     Las Vegas, NV (2016)          0.961     0.960
##  2 Las Vegas, NV (2016)     Green Bay, WI (2016)          0.961     0.961
##  3 Madison, WI (2016)       Las Vegas, NV (2016)          0.949     0.945
##  4 Milwaukee, WI (2016)     Las Vegas, NV (2016)          0.944     0.941
##  5 New Orleans, LA (2016)   Traverse City, MI (2016)      0.941     0.946
##  6 Traverse City, MI (2016) New Orleans, LA (2016)        0.941     0.937
##  7 Minneapolis, MN (2016)   New Orleans, LA (2016)        0.939     0.941
##  8 Grand Rapids, MI (2016)  Las Vegas, NV (2016)          0.937     0.939
##  9 Houston, TX (2016)       Traverse City, MI (2016)      0.935     0.938
## 10 Chicago, IL (2016)       Las Vegas, NV (2016)          0.933     0.933
## 11 Detroit, MI (2016)       Las Vegas, NV (2016)          0.928     0.928
## 12 Indianapolis, IN (2016)  Las Vegas, NV (2016)          0.920     0.927
## 13 San Diego, CA (2016)     Las Vegas, NV (2016)          0.910     0.900
## 14 Lincoln, NE (2016)       Las Vegas, NV (2016)          0.906     0.907
## 15 Newark, NJ (2016)        New Orleans, LA (2016)        0.873     0.858
# Find the worst match by locale
allAccuracy_month %>%
    group_by(locale) %>%
    top_n(-accOverall, n=1) %>%
    arrange(accOverall)
## # A tibble: 15 x 4
## # Groups:   locale [15]
##    locale                   other                   accOverall accLocale
##    <chr>                    <chr>                        <dbl>     <dbl>
##  1 Chicago, IL (2016)       Milwaukee, WI (2016)         0.578     0.619
##  2 Milwaukee, WI (2016)     Chicago, IL (2016)           0.578     0.536
##  3 Detroit, MI (2016)       Grand Rapids, MI (2016)      0.584     0.605
##  4 Grand Rapids, MI (2016)  Detroit, MI (2016)           0.584     0.564
##  5 Traverse City, MI (2016) Grand Rapids, MI (2016)      0.598     0.574
##  6 Madison, WI (2016)       Chicago, IL (2016)           0.600     0.569
##  7 Green Bay, WI (2016)     Madison, WI (2016)           0.602     0.631
##  8 Indianapolis, IN (2016)  Chicago, IL (2016)           0.613     0.590
##  9 Lincoln, NE (2016)       Indianapolis, IN (2016)      0.618     0.559
## 10 Minneapolis, MN (2016)   Madison, WI (2016)           0.626     0.611
## 11 Houston, TX (2016)       New Orleans, LA (2016)       0.640     0.643
## 12 New Orleans, LA (2016)   Houston, TX (2016)           0.640     0.637
## 13 Newark, NJ (2016)        Indianapolis, IN (2016)      0.662     0.678
## 14 San Diego, CA (2016)     Indianapolis, IN (2016)      0.846     0.857
## 15 Las Vegas, NV (2016)     Newark, NJ (2016)            0.872     0.872
# Overall accuracy by location plot
allAccuracy_month %>%
    group_by(locale) %>%
    summarize_if(is.numeric, mean) %>%
    ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) + 
    geom_point(size=2) + 
    geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
    coord_flip() + 
    labs(x="", 
         y="Average Accuracy", 
         title="Average Accuracy Predicting Locale",
         subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
         caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)"
         ) + 
    ylim(c(0.5, 1))

# Overall accuracy heatmap
allAccuracy_month %>% 
    ggplot(aes(x=locale, y=other)) + 
    geom_tile(aes(fill=accOverall)) + 
    theme(axis.text.x=element_text(angle=90)) + 
    scale_fill_continuous("Accuracy", high="darkblue", low="white") + 
    labs(title="Accuracy Predicting Locale vs. Locale", 
         caption="Temperature and Dew Point as only predictors\n(50% is baseline coinflip accuracy)",
         x="",
         y=""
         )

Accuracies are meaningfully higher. Of interest, there is a new group of highly differentiated locales, including New Orleans/Houston vs. Traverse City/Minneapolis. As hypothesized, adding month significantly aids in differentiating wintry cities from hot-humid cities, as summer in a wintry city may otherwise be hard to distinguish from spring/fall in a hot-humid city.

In fact, with the exception of Houston vs. New Orleans, there appears to be very good differentiation for each of Las Vegas, San Diego, Hoston, and New Orleans from all other locales.

Change in accuracy can also be plotted:

# Pull the accuracies for the two different models
x1 <- allAccuracy %>%
    group_by(locale) %>%
    summarize(acc1=mean(accOverall))
x2 <- allAccuracy_month %>%
    group_by(locale) %>%
    summarize(acc2=mean(accOverall))

# Merge accuracy data and plot
x1 %>%
    inner_join(x2, by="locale") %>%
    ggplot(aes(x=fct_reorder(locale, acc2))) + 
    geom_point(aes(y=acc2), size=2) +
    geom_point(aes(y=acc1), size=1) + 
    geom_text(aes(y=acc2+0.02, label=paste0(round(100*acc2), "%"))) +
    geom_text(aes(y=acc1-0.02, label=paste0(round(100*acc1), "%")), size=3) +
    geom_segment(aes(xend=fct_reorder(locale, acc2), y=acc1, yend=acc2-0.005), 
                 arrow=arrow(length=unit(0.3, "cm"), type="closed")
                 ) +
    coord_flip() + 
    labs(x="", 
         y="Average Accuracy", 
         title="Average Accuracy Predicting Locale",
         subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
         caption="Temperature, Dew Point, month as predictors\n(50% is baseline coinflip accuracy)"
         ) + 
    ylim(c(0.5, 1))

The increase in accuracy is particularly striking for New Orleans and Houston.

Next, the simple model is run to classify locale across the full dataset. The rfTwoLocales() function can be called in a slightly modified manner for this purpose:

# Run random forest for multiple locales
rfMultiLocale <- function(tbl, 
                          vrbls,
                          locs=NULL, 
                          locVar="source", 
                          otherVar="dtime",
                          pred="locale", 
                          seed=NULL, 
                          ntree=100, 
                          mtry=NULL, 
                          testSize=0.3
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble
    # vrbls: explanatory variables for modeling
    # locs: the locations to use (NULL means all)
    # locVar: the name of the variable where locs can be found
    # otherVar: other variables to be kept, but not used in modeling
    # pred: predictor variable for modeling
    # seed: the random seed (NULL means no seed)
    # ntree: the number of trees to grow in the random forest
    # mtry: the splitting parameter for the random forest (NULL means use all variables)
    # testSize: the fractional portion of data that should be used as the test dataset
    
    # Create locs if it has not been passed
    if (is.null(locs)) {
        locs <- tbl %>% pull(locVar) %>% unique() %>% sort()
        cat("\nRunning for locations:\n")
        print(locs)
    }
    
    # Pass to rfTwoLocales
    rfOut <- rfTwoLocales(tbl, 
                          loc1=locs, 
                          loc2=c(), 
                          locVar=locVar,
                          otherVar=otherVar, 
                          vrbls=vrbls, 
                          pred=pred, 
                          seed=seed, 
                          ntree=ntree, 
                          mtry=mtry, 
                          testSize=testSize
                          )
    
    # Return the list object
    rfOut
    
}

The function can then be applied to the 2016 data:

# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData, 
                                 vrbls=c("TempF", "DewF", "month"),
                                 locs=names_2016, 
                                 ntree=50, 
                                 seed=2006201341
                                 )

Summaries can then be created for the accuracy in predicting each locale:

# Create summary of accuracy
all2016Accuracy <- rf_all_2016_TDm$testData %>%
    mutate(locale=factor(locale, levels=levels(predicted))) %>%
    count(locale, predicted, correct) %>%
    group_by(locale) %>%
    mutate(pct=n/sum(n)) %>%
    ungroup()

# Calculate the number of levels
nLevels <- length(levels(factor(all2016Accuracy$locale)))
nullAcc <- 1 / nLevels

# Create plot for overall accuracy
all2016Accuracy %>%
    filter(locale==predicted) %>%
    ggplot(aes(x=fct_reorder(locale, pct))) + 
    geom_point(aes(y=pct), size=2) + 
    geom_text(aes(y=pct+0.04, label=paste0(round(100*pct), "%"))) +
    geom_hline(aes(yintercept=nullAcc), lty=2) +
    coord_flip() + 
    ylim(0, 1) + 
    labs(x="", 
         y="Correctly Predicted", 
         title="Accuracy of Locale Predictions", 
         subtitle="(positive detection rate by locale)", 
         caption=paste0("Temperature, Dew Point, Month as predictors\n(", 
                        round(100*nullAcc), 
                        "% is baseline null accuracy)"
                        )
         )

# Order locales sensibly
locOrder <- all2016Accuracy %>%
    filter(correct) %>%
    arrange(pct) %>%
    pull(locale)

# Create plot for which locales are classified as each other
all2016Accuracy %>%
    mutate(locale=factor(locale, levels=locOrder), 
           predPretty=factor(str_replace(predicted, pattern=", ", replacement="\n"), 
                             levels=str_replace(locOrder, pattern=", ", replacement="\n")
                             )
           ) %>%
    ggplot(aes(y=locale, x=predPretty)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(round(100*pct), "%"))) + 
    scale_fill_continuous("% Predicted As", low="white", high="green") + 
    scale_x_discrete(position="top") +
    theme(axis.text.x=element_text(angle=90)) + 
    labs(x="",
         y="Actual Locale", 
         title="Predicted Locale vs. Actual Locale", 
         caption="Temperature, Dew Point, Month as predictors"
         )

The model is fairly accurate in predicting Las Vegas and San Diego. The model frequently predicts Houston and New Orleans as one of the group, but often classifying one locale as the other.

Accuracy is much lower for many of the cold wether cities. While there is improvement relative to the null accuracy, there is significant overlap in the temperature and dew point by month.

Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart Las Vegas (almost always clear) and San Diego (frequent marine layer).

A helper function is built to map cloud heights to buckets, then minimum cloud height and ceiling are added to the model:

# Convert cloud height to buckets and factor
mapCloudHeight <- function(x) {
    
    factor(case_when(x==-100 ~ "None", 
                     x <= 1000 ~ "Surface", 
                     x <= 3000 ~ "Low", 
                     x <= 6000 ~ "Medium", 
                     x <= 12000 ~ "High", 
                     TRUE ~ "Error"
                     ), 
           levels=c("Surface", "Low", "Medium", "High", "None")
           )
    
}

# Apply to metarData and keep variables of interest
modData <- metarData %>%
    mutate(minHeight=mapCloudHeight(minHeight), ceilingHeight=mapCloudHeight(ceilingHeight))

An updated random forest model is then run:

# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(modData, 
                                  vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
                                  locs=names_2016, 
                                  ntree=50, 
                                  seed=2006201355
                                  )

The evaluation process is converted to a function:

# Evaluate model predictions
evalPredictions <- function(lst, 
                            plotCaption, 
                            keyVar="locale"
                            ) {

    # FUNCTION ARGUMENTS:
    # lst: the list containing outputs of the modeling
    # plotCaption: description of predictors used
    # keyVar: the variable that represents truth when assessing predictions
    
    # Create summary of accuracy
    all2016Accuracy <- lst$testData %>%
        mutate(locale=factor(get(keyVar), levels=levels(predicted))) %>%
        count(locale, predicted, correct) %>%
        group_by(locale) %>%
        mutate(pct=n/sum(n)) %>%
        ungroup()

    # Calculate the number of levels
    nLevels <- length(levels(factor(all2016Accuracy$locale)))
    nullAcc <- 1 / nLevels

    # Create plot for overall accuracy
    p1 <- all2016Accuracy %>%
        filter(locale==predicted) %>%
        ggplot(aes(x=fct_reorder(locale, pct))) + 
        geom_point(aes(y=pct), size=2) + 
        geom_text(aes(y=pct+0.04, label=paste0(round(100*pct), "%"))) +
        geom_hline(aes(yintercept=nullAcc), lty=2) +
        coord_flip() + 
        ylim(0, 1) + 
        labs(x="", 
             y="Correctly Predicted", 
             title="Accuracy of Locale Predictions", 
             subtitle="(positive detection rate by locale)", 
             caption=paste0(plotCaption, 
                            " as predictors\n(", 
                            round(100*nullAcc), 
                            "% is baseline null accuracy)"
                            )
             )
    print(p1)

    # Order locales sensibly
    locOrder <- all2016Accuracy %>%
        filter(correct) %>%
        arrange(pct) %>%
        pull(locale)

    # Create plot for which locales are classified as each other
    p2 <- all2016Accuracy %>%
        mutate(locale=factor(locale, levels=locOrder), 
               predPretty=factor(str_replace(predicted, pattern=", ", replacement="\n"), 
                                 levels=str_replace(locOrder, pattern=", ", replacement="\n")
                                 )
               ) %>%
        ggplot(aes(y=locale, x=predPretty)) + 
        geom_tile(aes(fill=pct)) + 
        geom_text(aes(label=paste0(round(100*pct), "%"))) + 
        scale_fill_continuous("% Predicted As", low="white", high="green") + 
        scale_x_discrete(position="top") +
        theme(axis.text.x=element_text(angle=90)) + 
        labs(x="",
             y="Actual Locale", 
             title="Predicted Locale vs. Actual Locale", 
             caption=paste0(plotCaption, " as predictors")
             )
    print(p2)
    
    # Return the accuracy object
    all2016Accuracy
    
}

evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 225 x 5
##    locale             predicted               correct     n    pct
##    <fct>              <fct>                   <lgl>   <int>  <dbl>
##  1 Chicago, IL (2016) Chicago, IL (2016)      TRUE      706 0.266 
##  2 Chicago, IL (2016) Detroit, MI (2016)      FALSE     197 0.0742
##  3 Chicago, IL (2016) Grand Rapids, MI (2016) FALSE     180 0.0678
##  4 Chicago, IL (2016) Green Bay, WI (2016)    FALSE     119 0.0448
##  5 Chicago, IL (2016) Houston, TX (2016)      FALSE      56 0.0211
##  6 Chicago, IL (2016) Indianapolis, IN (2016) FALSE     263 0.0991
##  7 Chicago, IL (2016) Las Vegas, NV (2016)    FALSE      28 0.0105
##  8 Chicago, IL (2016) Lincoln, NE (2016)      FALSE     119 0.0448
##  9 Chicago, IL (2016) Madison, WI (2016)      FALSE     168 0.0633
## 10 Chicago, IL (2016) Milwaukee, WI (2016)    FALSE     220 0.0829
## # ... with 215 more rows

Accuracy increases, especially for San Diego and several of the cold wether cities.

The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:

# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(modData, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", 
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=names_2016, 
                                   ntree=50, 
                                   seed=2006201355
                                   )

The evaluation process can again be run:

evalPredictions(rf_all_2016_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
                )

## # A tibble: 225 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016) Chicago, IL (2016)      TRUE      922 0.347  
##  2 Chicago, IL (2016) Detroit, MI (2016)      FALSE     179 0.0674 
##  3 Chicago, IL (2016) Grand Rapids, MI (2016) FALSE     186 0.0701 
##  4 Chicago, IL (2016) Green Bay, WI (2016)    FALSE     139 0.0524 
##  5 Chicago, IL (2016) Houston, TX (2016)      FALSE      28 0.0105 
##  6 Chicago, IL (2016) Indianapolis, IN (2016) FALSE     186 0.0701 
##  7 Chicago, IL (2016) Las Vegas, NV (2016)    FALSE       8 0.00301
##  8 Chicago, IL (2016) Lincoln, NE (2016)      FALSE     101 0.0380 
##  9 Chicago, IL (2016) Madison, WI (2016)      FALSE     183 0.0689 
## 10 Chicago, IL (2016) Milwaukee, WI (2016)    FALSE     285 0.107  
## # ... with 215 more rows

Including wind significantly improves model accuracy for most locales. Even the cold weather cities are now being predicted with 33%-50% accurcay.

Next, a smaller subset of the weather data is explored, with the cities grouped as:

  • Las Vegas
  • San Diego
  • New Orleans / Houston
  • Newark
  • Lincoln
  • Traverse City / Grand Rapids / Detroit
  • Chicago / Milwaukee / Madison / Green Bay / Minneapolis / Indianapolis

Further, a variable will be created for “hr”, the Zulu hour of the observation:

locale2016Mapper <- c('Cold-MI', 'Newark', 'Cold', 'Cold-MI', 'Humid', 'Cold', 'Desert', 'Lincoln', 'Cold', 'Cold', 'Cold', 'Humid', 'Cold', 'Marine', 'Cold-MI')
names(locale2016Mapper) <- c('Detroit, MI (2016)', 'Newark, NJ (2016)', 'Green Bay, WI (2016)', 'Grand Rapids, MI (2016)', 'Houston, TX (2016)', 'Indianapolis, IN (2016)', 'Las Vegas, NV (2016)', 'Lincoln, NE (2016)', 'Milwaukee, WI (2016)', 'Madison, WI (2016)', 'Minneapolis, MN (2016)', 'New Orleans, LA (2016)', 'Chicago, IL (2016)', 'San Diego, CA (2016)', 'Traverse City, MI (2016)')

tibble::tibble(locale=names(locale2016Mapper), mapping=locale2016Mapper) %>%
    arrange(mapping, locale)
## # A tibble: 15 x 2
##    locale                   mapping
##    <chr>                    <chr>  
##  1 Chicago, IL (2016)       Cold   
##  2 Green Bay, WI (2016)     Cold   
##  3 Indianapolis, IN (2016)  Cold   
##  4 Madison, WI (2016)       Cold   
##  5 Milwaukee, WI (2016)     Cold   
##  6 Minneapolis, MN (2016)   Cold   
##  7 Detroit, MI (2016)       Cold-MI
##  8 Grand Rapids, MI (2016)  Cold-MI
##  9 Traverse City, MI (2016) Cold-MI
## 10 Las Vegas, NV (2016)     Desert 
## 11 Houston, TX (2016)       Humid  
## 12 New Orleans, LA (2016)   Humid  
## 13 Lincoln, NE (2016)       Lincoln
## 14 San Diego, CA (2016)     Marine 
## 15 Newark, NJ (2016)        Newark
modData <- modData %>%
    mutate(hr=lubridate::hour(dtime), 
           locType=locale2016Mapper[locale]
           )

modData %>%
    count(locType, year) %>%
    pivot_wider(locType, names_from="year", values_from="n")
## # A tibble: 8 x 4
##   locType `2016` `2015` `2017`
##   <chr>    <int>  <int>  <int>
## 1 Cold     52521     NA     NA
## 2 Cold-MI  26300     NA     NA
## 3 Desert    8770     NA     NA
## 4 Humid    17535     NA     NA
## 5 Lincoln   8765     NA     NA
## 6 Marine    8762     NA     NA
## 7 Newark    8773     NA     NA
## 8 <NA>        NA  34919  34871

The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:

# Set a seed for reporducibility
set.seed(2006211352)

# Find the smallest locale type
nSmall <- modData %>%
    filter(!is.na(locType)) %>%
    count(locType) %>%
    pull(n) %>%
    min()

# Create the relevant data subset
subData <- modData %>%
    filter(!is.na(locType)) %>%
    group_by(locType) %>%
    sample_n(size=nSmall, replace=FALSE) %>%
    ungroup()

# Sumarize the data subset
subData %>% 
    count(locale) %>% 
    arrange(-n)
## # A tibble: 15 x 2
##    locale                       n
##    <chr>                    <int>
##  1 Las Vegas, NV (2016)      8762
##  2 Lincoln, NE (2016)        8762
##  3 Newark, NJ (2016)         8762
##  4 San Diego, CA (2016)      8762
##  5 New Orleans, LA (2016)    4401
##  6 Houston, TX (2016)        4361
##  7 Grand Rapids, MI (2016)   2953
##  8 Traverse City, MI (2016)  2951
##  9 Detroit, MI (2016)        2858
## 10 Milwaukee, WI (2016)      1503
## 11 Chicago, IL (2016)        1482
## 12 Minneapolis, MN (2016)    1473
## 13 Madison, WI (2016)        1452
## 14 Green Bay, WI (2016)      1439
## 15 Indianapolis, IN (2016)   1413

The previous model can then be run on the data subset:

# Run random forest for all 2016 locale types
rf_types_2016_TDmcw <- rfMultiLocale(subData, 
                                     vrbls=c("TempF", "DewF", 
                                             "month", 
                                             "minHeight", "ceilingHeight", 
                                             "WindSpeed", "predomDir"
                                             ),
                                     locs=NULL, 
                                     locVar="locType",
                                     pred="locType",
                                     ntree=50, 
                                     seed=2006211401
                                     )
## 
## Running for locations:
## [1] "Cold"    "Cold-MI" "Desert"  "Humid"   "Lincoln" "Marine"  "Newark"

The evaluation process can again be run:

evalPredictions(rf_types_2016_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction", 
                keyVar="locType"
                )

## # A tibble: 49 x 5
##    locale  predicted correct     n    pct
##    <fct>   <fct>     <lgl>   <int>  <dbl>
##  1 Cold    Cold      TRUE     1127 0.429 
##  2 Cold    Cold-MI   FALSE     633 0.241 
##  3 Cold    Desert    FALSE      47 0.0179
##  4 Cold    Humid     FALSE      86 0.0327
##  5 Cold    Lincoln   FALSE     350 0.133 
##  6 Cold    Marine    FALSE      63 0.0240
##  7 Cold    Newark    FALSE     322 0.123 
##  8 Cold-MI Cold      FALSE     546 0.208 
##  9 Cold-MI Cold-MI   TRUE     1327 0.506 
## 10 Cold-MI Desert    FALSE      39 0.0149
## # ... with 39 more rows

Predictive ability is strongest for the well-differentiated types, and weakest for the more poorly differentiated types. The model is about 50% accurate in classifying cold climates; 25% of the time they are classified as they other type of cold climate, and 25% of the time they are classified as something else (typically Lincoln or Newark).

The modeling is run again using a much smaller training dataset (testSize set to 0.9) to see the implication of a much smaller data volume:

# Run random forest for all 2016 locale types
rf_types_2016_small_TDmcw <- rfMultiLocale(subData, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", 
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir"
                                                   ),
                                           locs=NULL, 
                                           locVar="locType",
                                           pred="locType",
                                           ntree=50, 
                                           seed=2006211419,
                                           testSize=0.9
                                           )
## 
## Running for locations:
## [1] "Cold"    "Cold-MI" "Desert"  "Humid"   "Lincoln" "Marine"  "Newark"

The evaluation process can again be run:

evalPredictions(rf_types_2016_small_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction", 
                keyVar="locType"
                )

## # A tibble: 49 x 5
##    locale  predicted correct     n    pct
##    <fct>   <fct>     <lgl>   <int>  <dbl>
##  1 Cold    Cold      TRUE     2510 0.318 
##  2 Cold    Cold-MI   FALSE    2123 0.269 
##  3 Cold    Desert    FALSE     150 0.0190
##  4 Cold    Humid     FALSE     322 0.0407
##  5 Cold    Lincoln   FALSE    1351 0.171 
##  6 Cold    Marine    FALSE     239 0.0302
##  7 Cold    Newark    FALSE    1210 0.153 
##  8 Cold-MI Cold      FALSE    1678 0.216 
##  9 Cold-MI Cold-MI   TRUE     3354 0.431 
## 10 Cold-MI Desert    FALSE     175 0.0225
## # ... with 39 more rows

As expected, predictions are less accurate with a smaller training dataset. Overall acuuracy falls from ~70% to ~60% with the much smaller training data volumes. This suggests the model is continuing to learn from the data, and may benefit from an expanded sample.

Next, the model is adapted to include the hour (as an integer, which may need to be rethought, though this will generally capture whether it is daytime or nighttime even without factoring) and the sea-level pressure variable:

# Run random forest for all 2016 locale types - add hour and modSLP, limit mtry to 4
rf_types_2016_TDmcwha <- rfMultiLocale(subData, 
                                       vrbls=c("TempF", "DewF", 
                                               "month", "hr",
                                               "minHeight", "ceilingHeight", 
                                               "WindSpeed", "predomDir", 
                                               "modSLP"
                                               ),
                                       locs=NULL, 
                                       locVar="locType",
                                       pred="locType",
                                       ntree=50, 
                                       seed=2006211423, 
                                       mtry=4
                                       )
## 
## Running for locations:
## [1] "Cold"    "Cold-MI" "Desert"  "Humid"   "Lincoln" "Marine"  "Newark"

The evaluation process can again be run:

evalPredictions(rf_types_2016_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="locType"
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 49 x 5
##    locale  predicted correct     n     pct
##    <fct>   <fct>     <lgl>   <int>   <dbl>
##  1 Cold    Cold      TRUE     1265 0.480  
##  2 Cold    Cold-MI   FALSE     615 0.233  
##  3 Cold    Desert    FALSE      18 0.00683
##  4 Cold    Humid     FALSE      59 0.0224 
##  5 Cold    Lincoln   FALSE     393 0.149  
##  6 Cold    Marine    FALSE      53 0.0201 
##  7 Cold    Newark    FALSE     233 0.0884 
##  8 Cold-MI Cold      FALSE     496 0.194  
##  9 Cold-MI Cold-MI   TRUE     1521 0.596  
## 10 Cold-MI Desert    FALSE      22 0.00861
## # ... with 39 more rows

Accuracy increases to 75%. In particular, the model is better able to distinguish Lincoln and Newark from the remaining locations, which has a follow-on effect of improving cold weather classifications.

Exploration is then run on distinguishing two locales from each other, to see the implications of parameters like data volume (controlled through testSize) and the number of trees.

Comparisons will be run as follows:

  • Desert vs. Humid (should be easiest to pull apart)
  • Cold vs. Cold-MI (should be hardest to pull apart)
ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)

desertHumid <- vector("list", length(ntrees) * length(testSizes))
n <- 1

for (ntree in ntrees) {
    for (testSize in testSizes) {

        desertHumid[[n]] <- rfTwoLocales(subData, 
                                         loc1="Desert", 
                                         loc2="Humid", 
                                         locVar="locType", 
                                         vrbls=c("TempF", "DewF", "month", "hr", 
                                                 "minHeight", "ceilingHeight", 
                                                 "WindSpeed", "predomDir", "modSLP"
                                                 ),
                                         pred="locType", 
                                         seed=2006211432,
                                         ntree=ntree, 
                                         mtry=4,
                                         testSize=testSize
                                         )
        desertHumid[[n]]$ntree <- ntree
        desertHumid[[n]]$testSize <- testSize
        
        n <- n+1
        
    }
}

An assessment of accuracy can then be performed:

df <- sapply(desertHumid, 
             FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
             ) %>%
    t() %>%
    tibble::as_tibble()

df %>%
    mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
    ggplot(aes(x=ntree, y=trainSize)) +
    geom_text(aes(label=paste0(round(100*accuracy), "%"))) + 
    scale_x_log10()

As expected, accuracy is very high for distinguishing Desert and Humid climates, even with small data volumes (training size 10% of the data) and a small number of trees.

The approach is then run for the cold weather cities:

ntrees <- c(10, 25, 50, 100, 250, 500)
testSizes <- c(0.9, 0.7, 0.3)

coldColdMI <- vector("list", length(ntrees) * length(testSizes))
n <- 1

for (ntree in ntrees) {
    for (testSize in testSizes) {

        coldColdMI[[n]] <- rfTwoLocales(subData, 
                                        loc1="Cold", 
                                        loc2="Cold-MI", 
                                        locVar="locType", 
                                        vrbls=c("TempF", "DewF", "month", "hr", 
                                                "minHeight", "ceilingHeight", 
                                                "WindSpeed", "predomDir", "modSLP"
                                                ),
                                        pred="locType", 
                                        seed=2006211501,
                                        ntree=ntree, 
                                        mtry=4,
                                        testSize=testSize
                                        )
        coldColdMI[[n]]$ntree <- ntree
        coldColdMI[[n]]$testSize <- testSize
        
        n <- n+1
        
    }
}

An assessment of accuracy can then be performed:

dfColdColdMI <- sapply(coldColdMI, 
                       FUN=function(x) { c(x[["errorRate"]]["OOB"], ntree=x$ntree, testSize=x$testSize) }
                       ) %>%
    t() %>%
    tibble::as_tibble()

dfColdColdMI %>%
    mutate(accuracy=1-OOB, trainSize=1-testSize) %>%
    ggplot(aes(x=ntree, y=trainSize)) +
    geom_text(aes(label=paste0(round(100*accuracy), "%"))) + 
    scale_x_log10()

Greater data volumes and greater tree depths are more helpful for distinguishing cold weather city types from each other. The smallest forest has a 56% accuracy (null 50%) while the largest has a 68% accuracy. This suggests some potential upside to continuing the cold weather modeling process with more cities added for greater data volumes.

Data from years other than 2016 can be fed in to the model, with predictions made to see whether the model is learning general differences in climate or specific features about 2016.

Data that were not included in the modeling can be predicted, with the results assessed:

  • 2016 data that was not chosen for modeling - serves as an additional test frame
  • 2015 and 2017 data - can show the extent to which the model is learning general climate differences as opposed to specific 2016 weather features

Example code includes:

nonUsedData <- modData %>%
    anti_join(select(subData, source, dtime))
## Joining, by = c("source", "dtime")
nonUsedData %>%
    mutate(localeName=str_replace(locale, pattern=" .{1}\\d{4}.*", replacement="")) %>%
    count(localeName, year) %>%
    pivot_wider(localeName, names_from="year", values_from="n")
## # A tibble: 15 x 4
##    localeName        `2015` `2016` `2017`
##    <chr>              <int>  <int>  <int>
##  1 Chicago, IL         8728   7285   8740
##  2 Detroit, MI           NA   5912     NA
##  3 Grand Rapids, MI      NA   5811     NA
##  4 Green Bay, WI         NA   7316     NA
##  5 Houston, TX           NA   4407     NA
##  6 Indianapolis, IN      NA   7307     NA
##  7 Las Vegas, NV       8727      8   8664
##  8 Lincoln, NE           NA      3     NA
##  9 Madison, WI           NA   7298     NA
## 10 Milwaukee, WI         NA   7257     NA
## 11 Minneapolis, MN       NA   7296     NA
## 12 New Orleans, LA     8727   4366   8723
## 13 Newark, NJ            NA     11     NA
## 14 San Diego, CA       8737     NA   8744
## 15 Traverse City, MI     NA   5815     NA

A function is written to make predictions and plot the results:

# Function to predict model on to df, then plot the outcomes
helperPredictPlot <- function(model, 
                              df, 
                              vrbls=NULL, 
                              origVar="locale", 
                              limitObs=500, 
                              predOrder=NULL, 
                              locMapper=NULL
                              ) {
    
    # FUNCTION ARGUMENTS:
    # model: a trained model
    # df: a data frame or tibble for the model to be predicted on to
    # vrbls: the variables needed by model (records with NA will be deleted)
    # origVar: the original (truth) variable from df
    # limitObs: the minimum number of observations for a locale to be plotted
    # predOrder: ordering of factor variable "predicted" (NULL means leave as-is)
    # locMapper: mapping file for locations (will order locale to match predicted if not NULL)
    
    # Get variable names if NULL, assuming random forest
    if (is.null(vrbls)) {
        vrbls <- model$importance %>% rownames()
    }
    
    # Filter df to have only complete cases for vrbls
    dfPred <- df %>%
        filter_at(vars(all_of(vrbls)), all_vars(!is.na(.)))
    
    # Make predictions
    dfPred <- dfPred %>%
        mutate(predicted=predict(model, newdata=select_at(dfPred, vars(all_of(vrbls)))))
    
    # Summarize predictions
    dfAccuracy <- dfPred %>%
        group_by_at(vars(all_of(c(origVar, "predicted")))) %>%
        summarize(n=n()) %>%
        mutate(pct=n/sum(n)) %>%
        ungroup()
    
    # Filter dfAccuracy to only origVar with count of at least limitObs
    hasEnough <- dfAccuracy %>%
        group_by_at(vars(all_of(origVar))) %>%
        summarize(n=sum(n)) %>%
        filter(n > limitObs) %>%
        pull(origVar)
    
    # Create the proper order for the factor variable 'predicted' if requested
    if (!is.null(predOrder)) {
        dfAccuracy <- dfAccuracy %>%
            mutate(predicted=factor(predicted, levels=predOrder))
    }
    
    # Create the base locale mapping
    locales <- dfAccuracy %>%
        pull(origVar) %>%
        unique()
    
    # Modify the locale mapping if a mapper is provided
    if (!is.null(locMapper) & !is.null(predOrder)) {
        localeStrip <- str_replace(locales, pattern=" .\\d{4}.", replacement="")
        locales <- locales[order(match(locMapper[localeStrip], predOrder))]
    }
    
    # Create plot for which locales are classified as each other
    p1 <- dfAccuracy %>%
        filter_at(vars(all_of(origVar)), all_vars(. %in% hasEnough)) %>%
        mutate(locale=factor(get(origVar), levels=locales)) %>%
        ggplot(aes(y=locale, x=predicted)) +
        geom_tile(aes(fill=pct)) +
        geom_text(aes(label=paste0(round(100*pct), "%"))) +
        scale_fill_continuous("% Predicted As", low="white", high="green") +
        scale_x_discrete(position="top") +
        theme(axis.text.x=element_text(angle=90)) +
        labs(x="",
             y="Actual Locale",
             title="Predicted Locale vs. Actual Locale"
             )
    print(p1)

    # Return the predictions summary
    dfAccuracy
    
}

The function can then be applied to the 2016 data:

# Modify the file for mapping locale to type
localeMapper <- locale2016Mapper
names(localeMapper) <- str_replace(names(localeMapper), pattern=" .\\d{4}.", replacement="")

# Predictions on 2016 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel, 
                  df=filter(nonUsedData, year==2016), 
                  predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"), 
                  locMapper=localeMapper
                  )

## # A tibble: 83 x 4
##    locale             predicted     n     pct
##    <chr>              <fct>     <int>   <dbl>
##  1 Chicago, IL (2016) Cold       3608 0.496  
##  2 Chicago, IL (2016) Cold-MI    1792 0.246  
##  3 Chicago, IL (2016) Desert       60 0.00825
##  4 Chicago, IL (2016) Humid       165 0.0227 
##  5 Chicago, IL (2016) Lincoln     766 0.105  
##  6 Chicago, IL (2016) Marine      157 0.0216 
##  7 Chicago, IL (2016) Newark      728 0.100  
##  8 Detroit, MI (2016) Cold       1111 0.189  
##  9 Detroit, MI (2016) Cold-MI    3290 0.560  
## 10 Detroit, MI (2016) Desert       66 0.0112 
## # ... with 73 more rows

The predictions seem in-line with the test dataset conclusions, as would be expected given that the “unused” 2016 data are randomly sampled and should be no different than that “test” 2016 dataset.

The function can then be applied to the 2015 and 2017 data:

# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2016_TDmcwha$rfModel, 
                  df=filter(nonUsedData, year!=2016), 
                  predOrder=c("Cold", "Cold-MI", "Lincoln", "Newark", "Marine", "Humid", "Desert"), 
                  locMapper=localeMapper
                  )

## # A tibble: 56 x 4
##    locale             predicted     n    pct
##    <chr>              <fct>     <int>  <dbl>
##  1 Chicago, IL (2015) Cold       3043 0.349 
##  2 Chicago, IL (2015) Cold-MI    2269 0.260 
##  3 Chicago, IL (2015) Desert      137 0.0157
##  4 Chicago, IL (2015) Humid       257 0.0295
##  5 Chicago, IL (2015) Lincoln    1373 0.157 
##  6 Chicago, IL (2015) Marine      223 0.0256
##  7 Chicago, IL (2015) Newark     1418 0.163 
##  8 Chicago, IL (2017) Cold       2537 0.291 
##  9 Chicago, IL (2017) Cold-MI    2299 0.263 
## 10 Chicago, IL (2017) Desert      227 0.0260
## # ... with 46 more rows

Model performance on 2015 and 2017 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the three well-differentiated types (Desert, Humid, Marine).

However, the model struggles to classify Chicago, placing it roughly equally as Cold (correct), Cold-MI, Lincoln, and Newark. This suggests cold-weather cities may be predicted on specific anomalies of 2016 (e.g., where the cold snaps hit). More data may help the model generalize better, specifcally to separate recurring and generalizable climate features of Chicago from weather anomalies that happened to hit Chicago in 2016.

Suppose that models are run on all 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:

# Create the subset for Chicago, Las Vegas, New Orleans, San Diego (should have 2015, 2016, 2017)
sub_2015_2017_data <- modData %>%
    filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")) %>%
    mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""))

# Check that proper locales are included
sub_2015_2017_data %>% 
    count(city, locale)
## # A tibble: 12 x 3
##    city            locale                     n
##    <chr>           <chr>                  <int>
##  1 Chicago, IL     Chicago, IL (2015)      8728
##  2 Chicago, IL     Chicago, IL (2016)      8767
##  3 Chicago, IL     Chicago, IL (2017)      8740
##  4 Las Vegas, NV   Las Vegas, NV (2015)    8727
##  5 Las Vegas, NV   Las Vegas, NV (2016)    8770
##  6 Las Vegas, NV   Las Vegas, NV (2017)    8664
##  7 New Orleans, LA New Orleans, LA (2015)  8727
##  8 New Orleans, LA New Orleans, LA (2016)  8767
##  9 New Orleans, LA New Orleans, LA (2017)  8723
## 10 San Diego, CA   San Diego, CA (2015)    8737
## 11 San Diego, CA   San Diego, CA (2016)    8762
## 12 San Diego, CA   San Diego, CA (2017)    8744
# Run random forest for 2015-2017 data
rf_types_2015_2017_TDmcwha <- rfMultiLocale(sub_2015_2017_data, 
                                            vrbls=c("TempF", "DewF", 
                                                    "month", "hr",
                                                    "minHeight", "ceilingHeight", 
                                                    "WindSpeed", "predomDir", 
                                                    "modSLP"
                                                    ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=25, 
                                            seed=2006221334, 
                                            mtry=4
                                            )
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2017_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="city"
                )

## # A tibble: 16 x 5
##    locale          predicted       correct     n     pct
##    <fct>           <fct>           <lgl>   <int>   <dbl>
##  1 Chicago, IL     Chicago, IL     TRUE     7473 0.941  
##  2 Chicago, IL     Las Vegas, NV   FALSE     110 0.0139 
##  3 Chicago, IL     New Orleans, LA FALSE     162 0.0204 
##  4 Chicago, IL     San Diego, CA   FALSE     196 0.0247 
##  5 Las Vegas, NV   Chicago, IL     FALSE     122 0.0158 
##  6 Las Vegas, NV   Las Vegas, NV   TRUE     7434 0.960  
##  7 Las Vegas, NV   New Orleans, LA FALSE      49 0.00633
##  8 Las Vegas, NV   San Diego, CA   FALSE     141 0.0182 
##  9 New Orleans, LA Chicago, IL     FALSE     158 0.0201 
## 10 New Orleans, LA Las Vegas, NV   FALSE      70 0.00891
## 11 New Orleans, LA New Orleans, LA TRUE     7356 0.937  
## 12 New Orleans, LA San Diego, CA   FALSE     268 0.0341 
## 13 San Diego, CA   Chicago, IL     FALSE     132 0.0168 
## 14 San Diego, CA   Las Vegas, NV   FALSE     188 0.0239 
## 15 San Diego, CA   New Orleans, LA FALSE     152 0.0194 
## 16 San Diego, CA   San Diego, CA   TRUE     7380 0.940

Even with a very small forest (25 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.

How do other cities map against these classifications?

# Predictions on 2015/2017 data
helperPredictPlot(rf_types_2015_2017_TDmcwha$rfModel, 
                  df=filter(modData, !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))), 
                  predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
                  )

## # A tibble: 44 x 4
##    locale                  predicted           n     pct
##    <chr>                   <fct>           <int>   <dbl>
##  1 Detroit, MI (2016)      Chicago, IL      7614 0.873  
##  2 Detroit, MI (2016)      Las Vegas, NV     262 0.0300 
##  3 Detroit, MI (2016)      New Orleans, LA   553 0.0634 
##  4 Detroit, MI (2016)      San Diego, CA     292 0.0335 
##  5 Grand Rapids, MI (2016) Chicago, IL      7927 0.918  
##  6 Grand Rapids, MI (2016) Las Vegas, NV     133 0.0154 
##  7 Grand Rapids, MI (2016) New Orleans, LA   283 0.0328 
##  8 Grand Rapids, MI (2016) San Diego, CA     288 0.0334 
##  9 Green Bay, WI (2016)    Chicago, IL      7958 0.915  
## 10 Green Bay, WI (2016)    Las Vegas, NV      85 0.00978
## # ... with 34 more rows
  • Houston is most similar to New Orleans as expected
  • Milwaukee, Green Bay, and Grand Rapids are all classified 90%+ as Chicago
  • Madison, Traverse City, Minneapolis, and Detroit are all classified 85%+ as Chicago
  • Lincoln, Indianapolis, and Newark are all “closest” to Chicago, but all show meaningful prediction volumes to New Orleans (these are the three “least cold” of the cold-weather cities in the modeling)

The prediction process is re-run excluding the altimeter (modSLP) data:

# Run random forest for 2015-2017 data (exclude modSLP)
rf_types_2015_2017_TDmcwh <- rfMultiLocale(sub_2015_2017_data, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", "hr",
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir"
                                                   ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=25, 
                                            seed=2006221334, 
                                            mtry=4
                                            )
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2017_TDmcwh, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="city"
                )

## # A tibble: 16 x 5
##    locale          predicted       correct     n    pct
##    <fct>           <fct>           <lgl>   <int>  <dbl>
##  1 Chicago, IL     Chicago, IL     TRUE     7296 0.919 
##  2 Chicago, IL     Las Vegas, NV   FALSE     140 0.0176
##  3 Chicago, IL     New Orleans, LA FALSE     263 0.0331
##  4 Chicago, IL     San Diego, CA   FALSE     242 0.0305
##  5 Las Vegas, NV   Chicago, IL     FALSE     186 0.0240
##  6 Las Vegas, NV   Las Vegas, NV   TRUE     7317 0.945 
##  7 Las Vegas, NV   New Orleans, LA FALSE      95 0.0123
##  8 Las Vegas, NV   San Diego, CA   FALSE     148 0.0191
##  9 New Orleans, LA Chicago, IL     FALSE     248 0.0316
## 10 New Orleans, LA Las Vegas, NV   FALSE     123 0.0157
## 11 New Orleans, LA New Orleans, LA TRUE     7137 0.909 
## 12 New Orleans, LA San Diego, CA   FALSE     344 0.0438
## 13 San Diego, CA   Chicago, IL     FALSE     175 0.0223
## 14 San Diego, CA   Las Vegas, NV   FALSE     231 0.0294
## 15 San Diego, CA   New Orleans, LA FALSE     202 0.0257
## 16 San Diego, CA   San Diego, CA   TRUE     7244 0.923

Performance drops, since altimeters are meaningfully different in high altitude (Las Vegas) and sea-level (New Orleans and San Diego) locations.

Variable importances are plotted:

helperPlotVarImp <- function(model, titleAdd="", mapper=varMapper) {
    
    p1 <- model %>%
        caret::varImp() %>%
        rownames_to_column("predictor") %>%
        ggplot(aes(x=fct_reorder(paste0(predictor, "\n", varMapper[predictor]), Overall), y=Overall)) + 
        geom_col(fill="lightblue") + 
        labs(x="", y="", title=paste0("Variable Importance", titleAdd)) + 
        coord_flip()
    print(p1)
    
}

helperPlotVarImp(rf_types_2015_2017_TDmcwha$rfModel)

helperPlotVarImp(rf_types_2015_2017_TDmcwh$rfModel, titleAdd=" (Excludes SLP)")

Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.

Prediction certainties can also be assessed:

assessPredictionCertainty <- function(lst, 
                                      plotCaption, 
                                      keyVar="locale",
                                      testData=lst[["testData"]],
                                      listModel="rfModel", 
                                      pkVars=c("dtime"), 
                                      thresh=0.8, 
                                      showHists=FALSE, 
                                      showAcc=FALSE
                                      ) {
    
    # FUNCTION ARGUMENTS:
    # lst: the list containing outputs of the modeling
    # plotCaption: description of predictors used
    # keyVar: the variable that represents truth when assessing predictions
    # testData: the test dataset (default is the 'testData element of list lst)
    # listModel: the named element of lst containing the random forest model
    # pkVars: the variable(s) of listData that, with keyVar, function as a primary key (unique identifier)
    # thresh: threshhold for voting confidence
    # showHists: boolean, whether to create histograms for confidence
    # showAcc: boolean, for whether to show accuracy by percent of votes (only makes sense when testData and listModel have the same levels for keyVar)

    # Get variable names for the random forest
    vrbls <- lst[[listModel]]$importance %>% rownames()
    
    # Filter testData to have only complete cases for vrbls
    dfPred <- testData %>%
        filter_at(vars(all_of(vrbls)), all_vars(!is.na(.))) %>%
        select_at(vars(all_of(c(keyVar, pkVars, vrbls))))
    
    # Create voting summary for every element in testData
    predSummary <- predict(lst[[listModel]], newdata=dfPred, type="prob") %>%
        as.data.frame() %>%
        tibble::as_tibble() %>%
        bind_cols(select_at(dfPred, vars(all_of(c(keyVar, pkVars))))) %>%
        mutate(finalPrediction=predict(lst[[listModel]], newdata=dfPred)) %>%
        pivot_longer(cols=-c(keyVar, pkVars, "finalPrediction"), 
                     names_to="prediction", 
                     values_to="pctVotes"
                     )
    
    # Create summary of maximum prediction
    allPredictions <- predSummary %>%
        group_by_at(vars(all_of(c(keyVar, pkVars)))) %>%
        mutate(maxPct=max(pctVotes), topProb=(pctVotes==maxPct)) %>%
        ungroup()
    
    
    # Create and show histograms if requested
    if (showHists) {
        
        # Create plot for confidence level by final prediction
        p1 <- allPredictions %>%
            filter(finalPrediction==prediction) %>%
            ggplot(aes(x=pctVotes, y=..count../sum(..count..))) + 
            geom_histogram() + 
            labs(title="Percent of Votes Received When Truth Is:", 
                 y="Frequency of Result", 
                 x="Percent of Votes Received"
                 ) +
            facet_wrap(~get(keyVar))
        print(p1)
    
        # Create plot for confidence level by final prediction
        p2 <- allPredictions %>%
            filter(finalPrediction==prediction) %>%
            ggplot(aes(x=pctVotes, y=..count../sum(..count..))) + 
            geom_histogram() + 
            labs(title="Percent of Votes Received When Final Prediction Is:", 
                 y="Frequency of Result", 
                 x="Percent of Votes Received"
                 ) +
            facet_wrap(~finalPrediction)
        print(p2)
    }
    
    
    # Create plot of total votes by type
    p3 <- allPredictions %>%
        group_by_at(vars(all_of(c(keyVar, "prediction")))) %>%
        summarize(meanVotes=mean(pctVotes)) %>%
        ggplot(aes_string(x="prediction", y=keyVar)) + 
        geom_tile(aes(fill=meanVotes)) + 
        geom_text(aes(label=paste0(round(100*meanVotes), "%"))) + 
        scale_fill_continuous("% Predicted As", low="white", high="green", limits=c(0, 1)) + 
        scale_x_discrete(position="top") +
        # theme(axis.text.x=element_text(angle=90)) + 
        labs(x="",
             y="Actual Locale", 
             title="Predicted Locale vs. Actual Locale", 
             subtitle="Average votes received by prediction",
             caption=paste0(plotCaption, " as predictors")
             )
    print(p3)
    
    
    # Create plot of accuracy when threshhold is met
    if (showAcc) {
        
        # Percent accuracy when threshold is met
        p4 <- allPredictions %>%
            filter(finalPrediction==prediction) %>%
            mutate(correct=(get(keyVar)==finalPrediction), 
                   meetThresh=factor(ifelse(pctVotes >= thresh, "Threshold Met", "Threshold Not Met"), 
                                     levels=c("Threshold Not Met", "Threshold Met")
                                     )
                   ) %>%
            group_by_at(c(keyVar, "meetThresh")) %>%
            summarize(pctCorrect=mean(correct)) %>%
            ggplot(aes_string(x=keyVar)) + 
            geom_point(aes(y=pctCorrect)) + 
            geom_text(aes(y=pctCorrect+0.08, label=paste0(round(100*pctCorrect), "%"))) +
            coord_flip() +
            ylim(c(0, 1.08)) + 
            facet_wrap(~meetThresh) +
            labs(x="Truth", 
                 y="Percent Correctly Classified", 
                 title="Accuracy of Classifying", 
                 subtitle=paste0("Threshold Met means percent of votes is >= ", round(100*thresh), "%"), 
                 caption=paste0(plotCaption, " as predictors")
                 )
        print(p4)
        
    }
    
    
    # Create plot of total votes by threshold
    p6 <- allPredictions %>%
        filter(finalPrediction==prediction) %>%
        mutate(modPredict=ifelse(pctVotes>=thresh, prediction, "Too Low")) %>%
        group_by_at(vars(all_of(c(keyVar, "modPredict")))) %>%
        summarize(n=n()) %>%
        mutate(pct=n/sum(n)) %>%
        ggplot(aes_string(x="modPredict", y=keyVar)) + 
        geom_tile(aes(fill=pct)) + 
        geom_text(aes(label=paste0(round(100*pct), "%"))) + 
        scale_fill_continuous("% Predicted As", low="white", high="lightblue", limits=c(0, 1)) + 
        scale_x_discrete(position="top") +
        # theme(axis.text.x=element_text(angle=90)) + 
        labs(x="",
             y="Actual Locale", 
             title="Predicted Locale vs. Actual Locale", 
             subtitle=paste0("Predictions with <", round(100*thresh), "% of votes classified as 'Too Low'"),
             caption=paste0(plotCaption, " as predictors")
             )
    print(p6)
    
    # Return the accuracy object
    allPredictions
    
}

The assessment can be run for the two main 2015-2017 models:

# Run for the full model including SLP
probs_2015_2017_TDmcwha <- 
    assessPredictionCertainty(rf_types_2015_2017_TDmcwha, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP", 
                              showAcc=TRUE
                              )
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

# Run for the model excluding SLP
probs_2015_2017_TDmcwh <- 
    assessPredictionCertainty(rf_types_2015_2017_TDmcwh, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind", 
                              showAcc=TRUE
                              )

Predictions with 80%+ of the votes are made ~80% of the time, and these predictions are ~99% accurate. Predictions with <80% of the votes are made ~20% of the times, and these predictions are ~75% accurate. The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction.

A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:

useData <- modData %>%
    filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan")))
    
# Run for the model excluding SLP
probs_allcities_2015_2017_TDmcwh <- 
    assessPredictionCertainty(rf_types_2015_2017_TDmcwh, 
                              testData=useData,
                              keyVar="locale", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind", 
                              showHists=TRUE
                              )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Patterns in the predictions stand out:

  • Houston is almost always classified as New Orleans when the model is confident (~65% of the time)
  • The cities that are at least as cold as Chicago are almost always classified as Chicago when the model is confident (~70% of the time)
  • Newark, Lincoln, and Indianapolis are generally not classified with confidence (~50% of predictions are to ‘Too Low’), suggesting these cities may fall in between the four archetypes
  • There are a meaningful number of ‘Too Low’ predictions for every locale, suggesting that many locales “loo like” at least two of these archetypes at certain times of the year